Earlier  
Posted Nick Remark
#openstack-nova - 2019-06-26
16:19:42 mriedem dansmith: that was the point of my fix
16:19:52 mriedem to reset to STOPPED if it was STOPPED
16:19:55 dansmith mriedem: right
16:19:57 mriedem well, in part,
16:20:02 mriedem the main point was don't put it in ERROR status
16:21:59 melwitt ok, I think I see. this is ok because the instance is actually ok (other than cosmetic), whereas for the first example, the instance was not ok and was proposed to auto-correct to an ok/healthy state
16:22:27 dansmith the auto-revert actually moved stuff back, IIRC
16:22:37 dansmith not just correcting state, but actual revert
16:22:39 melwitt yeah it did
16:23:09 melwitt I was zooming in on the vm_state part of it, how it appears to an external script like in your example in the comment
16:24:20 melwitt and then I was thinking, is that a problem, if we imagine an external script executing a cold migrate and it fails and the instance stays ACTIVE so the script doesn't know it didn't work. that sort of thing
16:25:08 melwitt I was wondering about that after I read the comments on the old auto-revert patch
16:26:00 dansmith but the difference is,
16:26:14 mriedem the external thing should be waiting for task_state to be None to know the operation is done (or the instance action is finished/error, or the migration status is 'finished' or whatever in this case)
16:26:18 dansmith the merged patch corrected state before it changed from $orig to MIGRATING or whatever, right?
16:26:20 mriedem polling the vm_state in the API is not sufficient
16:26:42 dansmith the auto-revert one has it go into all the migrating states and then pop back
16:27:16 dansmith specifically, potentially pop back to ACTIVE and not have moved, IIRC
16:27:43 melwitt yes, I believe it did prevent an ERROR state that occurred before going to MIGRATING
16:28:22 mriedem i'm getting lost in the "it" references here when talking about separate changes
16:28:32 melwitt heh, sorry. the merged change
16:28:35 mriedem booth's change was,
16:28:51 mriedem resize/cold migrated failed somewhere and somehow, and the instance was set to ERROR status, right?
16:29:14 mriedem and if you tried doing a revertResize API call on that ERROR instance, it would do the revert resize flow to go back from the dest to the source host
16:29:31 dansmith no, it did a full revert I think
16:29:37 mriedem even though what we could have failed on was maybe something in prep_resize or resize_instance before the guest / volumes / networking ever actually *got* to the dest host
16:30:10 dansmith so we get to the dest host, fail, auto-revert back to source, and go back to ACTIVE
16:30:23 dansmith you wait for ACTIVE to mean "success" but really it failed and the instance hasn't resized or move
16:30:37 melwitt yeah, I think it was a full revert on the booth change. i.e. do automatically what a user would have to do, initiate a revert
16:30:37 mriedem oh i see https://review.opendev.org/#/c/462521/12/nova/compute/manager.py@4449
16:30:39 dansmith granted it's been 18 months since I last looked at this
16:30:52 dansmith it's really the opposite of what mriedem's change was doing,
16:31:06 dansmith which was keep it active if we don't start
16:31:22 mriedem or stopped rather than active...
16:31:33 dansmith well, and that's an important piece yeah
16:31:37 mriedem i.e. start resize with a stopped server, prep_resize fails, don't reset to active *because it's stopped*
16:31:44 dansmith right
16:31:53 mriedem eventually the power sync task would stop the instance i think but still
16:32:08 dansmith or restart it when it shouldn't, right?
16:32:17 melwitt yeah, makes sense
16:32:18 dansmith if vm_state is active, it was stopped, power state sync says "hmm, this should be running"
16:32:21 mriedem i don't think that task ever starts anything
16:32:38 mriedem even though people have asked for that in the past
16:32:54 dansmith no? I thought it would for things like post-host-failure recovery
16:32:56 mriedem i believe the reasoning was always, we don't want to turn things on by guessing and then bill the user
16:33:11 dansmith well, billing is unrelated to started or stopped, but okay :)
16:33:26 dansmith it's a complex enough not-really-a-state-machine that I'm sure I'm getting it wrong
16:33:28 mriedem depends on how you do your billing
16:33:35 dansmith regardless, ACTIVE but not running is about as bad
16:33:37 mriedem same - it's been a long time since i loked
16:33:42 mriedem *looked
16:34:09 mriedem anyway, i agree that if i'm doing a resize (and i'm sure tempest would do this), you're waiting for the instance to go to VERIFY_RESIZE with task_state=None,
16:34:23 mriedem it the instance goes back to ACTIVE with task_state=None, i'd wait indefinitely
16:34:28 mriedem unless i've got a timeout,
16:34:43 dansmith especially if you went into RESIZING in between
16:34:46 mriedem or also checking instance actions or migration status (which might be admin-only inof)
16:34:47 mriedem *info
16:35:10 mriedem i personally wouldn't try to track the task_state transitions since that's probably a losing game
16:35:21 mriedem i would just wait for terminal states but yeah
16:35:31 dansmith the thing is, ACTIVE is a terminal state for auto-confirm
16:35:45 mriedem true yeah
16:35:46 dansmith so if it went ACTIVE -> RESIZING -> ACTIVE, you should assume it actually resized and was auto-confirmed
16:35:51 dansmith but with auto-revert,
16:35:55 mriedem i know powervc set auto-confirm to 1 second
16:35:56 dansmith that breaks that behavior
16:36:12 mriedem lbragstad had to fix a few race bugs as a result :)
16:36:15 dansmith with auto-revert, ACTIVE->RESIZING->ACTIVE could mean "it worked" or "it didn't"
16:36:35 mriedem dansmith: yeah, and you wouldn't know unless you checked the migratoin or instance actions, which you as a non-admin might not have access to those details
16:36:42 melwitt yeah, I see
16:36:56 dansmith it turns waiting for a terminal state into a much more complex affair for sure
16:37:02 openstackgerrit Merged openstack/nova master: Replace deprecated with_lockmode with with_for_update https://review.opendev.org/666221
16:37:57 melwitt that's a helpful way to think about it, imagining what a tempest (or func test) would need to do to be able to automate it
16:40:15 mriedem maybe should link this conversation into the abandoned change so we have that when this comes up again in 2 years :)
16:40:41 melwitt yeah, that's a good idea. let me do that now
16:44:45 sean-k-mooney ... i started reading the scroll back and i think on second tought i not going to do that
16:47:29 sean-k-mooney melwitt: the only way for a non admin to deterim if a cold migrate suceeded would be to check the hashed host id before and after
16:48:19 sean-k-mooney for resize they could check the if the flavor is the one they expected
16:48:31 dansmith sean-k-mooney: not really
16:48:43 dansmith oh for a strict migration, yeah
16:48:51 dansmith was going to say, resize to same host breaks that assumption
16:49:02 melwitt sean-k-mooney: could also observe ACTIVE -> RESIZING -> ACTIVE as dansmith described, right? as non-admin
16:49:22 dansmith melwitt: yes
16:49:34 sean-k-mooney you could observe it if you pool but you would not know if it succeded or failed
16:49:43 sean-k-mooney without also checking if the falvor is the old or new one
16:49:58 dansmith sean-k-mooney: you won't go back to active from resizing currently
16:50:28 sean-k-mooney oh ok so that was the change ye were talking about
16:50:36 melwitt sean-k-mooney: if it failed [after going to RESIZING] it would go to ERROR. are you talking hypothetically with the abandoned patch?
16:51:15 sean-k-mooney melwitt: there are case we i though it would auto revert that went back to active
16:51:42 melwitt sean-k-mooney: no, that was the proposal in the abandoned patch
16:52:21 sean-k-mooney ok i might be thinking about live migrate then
16:52:41 sean-k-mooney for live migrate we can fail to migrate but still be in active
16:58:20 sean-k-mooney so ya looking at code earch revert_resize is only ever called form the api which simplifes some things but not others
16:59:28 sean-k-mooney melwitt: do we currently allow you to revert a resize for an instance that is in error because the resize failed
16:59:58 sean-k-mooney so you can go active->resizeing->error->active?
17:00:36 mriedem fwiw, as a non-admin i think you can tell if your resize failed if the instance action "message" is not null /servers/{server_id}/os-instance-actions/{request_id}
17:00:40 mriedem er GET /servers/{server_id}/os-instance-actions/{request_id}
17:00:43 melwitt sean-k-mooney: I think so, based on the abandoned patch. it was proposing to do that automatically (from error)
17:01:37 sean-k-mooney melwitt: ok if we did not you would have to do rest state (which is admin only?) + a hard reboot
17:02:12 melwitt mriedem: you mean failed before resize started right
17:02:29 mriedem no if the operation failed

Earlier   Later