| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-26 | |||
| 16:30:10 | dansmith | so we get to the dest host, fail, auto-revert back to source, and go back to ACTIVE | |
| 16:30:23 | dansmith | you wait for ACTIVE to mean "success" but really it failed and the instance hasn't resized or move | |
| 16:30:37 | mriedem | oh i see https://review.opendev.org/#/c/462521/12/nova/compute/manager.py@4449 | |
| 16:30:37 | melwitt | yeah, I think it was a full revert on the booth change. i.e. do automatically what a user would have to do, initiate a revert | |
| 16:30:39 | dansmith | granted it's been 18 months since I last looked at this | |
| 16:30:52 | dansmith | it's really the opposite of what mriedem's change was doing, | |
| 16:31:06 | dansmith | which was keep it active if we don't start | |
| 16:31:22 | mriedem | or stopped rather than active... | |
| 16:31:33 | dansmith | well, and that's an important piece yeah | |
| 16:31:37 | mriedem | i.e. start resize with a stopped server, prep_resize fails, don't reset to active *because it's stopped* | |
| 16:31:44 | dansmith | right | |
| 16:31:53 | mriedem | eventually the power sync task would stop the instance i think but still | |
| 16:32:08 | dansmith | or restart it when it shouldn't, right? | |
| 16:32:17 | melwitt | yeah, makes sense | |
| 16:32:18 | dansmith | if vm_state is active, it was stopped, power state sync says "hmm, this should be running" | |
| 16:32:21 | mriedem | i don't think that task ever starts anything | |
| 16:32:38 | mriedem | even though people have asked for that in the past | |
| 16:32:54 | dansmith | no? I thought it would for things like post-host-failure recovery | |
| 16:32:56 | mriedem | i believe the reasoning was always, we don't want to turn things on by guessing and then bill the user | |
| 16:33:11 | dansmith | well, billing is unrelated to started or stopped, but okay :) | |
| 16:33:26 | dansmith | it's a complex enough not-really-a-state-machine that I'm sure I'm getting it wrong | |
| 16:33:28 | mriedem | depends on how you do your billing | |
| 16:33:35 | dansmith | regardless, ACTIVE but not running is about as bad | |
| 16:33:37 | mriedem | same - it's been a long time since i loked | |
| 16:33:42 | mriedem | *looked | |
| 16:34:09 | mriedem | anyway, i agree that if i'm doing a resize (and i'm sure tempest would do this), you're waiting for the instance to go to VERIFY_RESIZE with task_state=None, | |
| 16:34:23 | mriedem | it the instance goes back to ACTIVE with task_state=None, i'd wait indefinitely | |
| 16:34:28 | mriedem | unless i've got a timeout, | |
| 16:34:43 | dansmith | especially if you went into RESIZING in between | |
| 16:34:46 | mriedem | or also checking instance actions or migration status (which might be admin-only inof) | |
| 16:34:47 | mriedem | *info | |
| 16:35:10 | mriedem | i personally wouldn't try to track the task_state transitions since that's probably a losing game | |
| 16:35:21 | mriedem | i would just wait for terminal states but yeah | |
| 16:35:31 | dansmith | the thing is, ACTIVE is a terminal state for auto-confirm | |
| 16:35:45 | mriedem | true yeah | |
| 16:35:46 | dansmith | so if it went ACTIVE -> RESIZING -> ACTIVE, you should assume it actually resized and was auto-confirmed | |
| 16:35:51 | dansmith | but with auto-revert, | |
| 16:35:55 | mriedem | i know powervc set auto-confirm to 1 second | |
| 16:35:56 | dansmith | that breaks that behavior | |
| 16:36:12 | mriedem | lbragstad had to fix a few race bugs as a result :) | |
| 16:36:15 | dansmith | with auto-revert, ACTIVE->RESIZING->ACTIVE could mean "it worked" or "it didn't" | |
| 16:36:35 | mriedem | dansmith: yeah, and you wouldn't know unless you checked the migratoin or instance actions, which you as a non-admin might not have access to those details | |
| 16:36:42 | melwitt | yeah, I see | |
| 16:36:56 | dansmith | it turns waiting for a terminal state into a much more complex affair for sure | |
| 16:37:02 | openstackgerrit | Merged openstack/nova master: Replace deprecated with_lockmode with with_for_update https://review.opendev.org/666221 | |
| 16:37:57 | melwitt | that's a helpful way to think about it, imagining what a tempest (or func test) would need to do to be able to automate it | |
| 16:40:15 | mriedem | maybe should link this conversation into the abandoned change so we have that when this comes up again in 2 years :) | |
| 16:40:41 | melwitt | yeah, that's a good idea. let me do that now | |
| 16:44:45 | sean-k-mooney | ... i started reading the scroll back and i think on second tought i not going to do that | |
| 16:47:29 | sean-k-mooney | melwitt: the only way for a non admin to deterim if a cold migrate suceeded would be to check the hashed host id before and after | |
| 16:48:19 | sean-k-mooney | for resize they could check the if the flavor is the one they expected | |
| 16:48:31 | dansmith | sean-k-mooney: not really | |
| 16:48:43 | dansmith | oh for a strict migration, yeah | |
| 16:48:51 | dansmith | was going to say, resize to same host breaks that assumption | |
| 16:49:02 | melwitt | sean-k-mooney: could also observe ACTIVE -> RESIZING -> ACTIVE as dansmith described, right? as non-admin | |
| 16:49:22 | dansmith | melwitt: yes | |
| 16:49:34 | sean-k-mooney | you could observe it if you pool but you would not know if it succeded or failed | |
| 16:49:43 | sean-k-mooney | without also checking if the falvor is the old or new one | |
| 16:49:58 | dansmith | sean-k-mooney: you won't go back to active from resizing currently | |
| 16:50:28 | sean-k-mooney | oh ok so that was the change ye were talking about | |
| 16:50:36 | melwitt | sean-k-mooney: if it failed [after going to RESIZING] it would go to ERROR. are you talking hypothetically with the abandoned patch? | |
| 16:51:15 | sean-k-mooney | melwitt: there are case we i though it would auto revert that went back to active | |
| 16:51:42 | melwitt | sean-k-mooney: no, that was the proposal in the abandoned patch | |
| 16:52:21 | sean-k-mooney | ok i might be thinking about live migrate then | |
| 16:52:41 | sean-k-mooney | for live migrate we can fail to migrate but still be in active | |
| 16:58:20 | sean-k-mooney | so ya looking at code earch revert_resize is only ever called form the api which simplifes some things but not others | |
| 16:59:28 | sean-k-mooney | melwitt: do we currently allow you to revert a resize for an instance that is in error because the resize failed | |
| 16:59:58 | sean-k-mooney | so you can go active->resizeing->error->active? | |
| 17:00:36 | mriedem | fwiw, as a non-admin i think you can tell if your resize failed if the instance action "message" is not null /servers/{server_id}/os-instance-actions/{request_id} | |
| 17:00:40 | mriedem | er GET /servers/{server_id}/os-instance-actions/{request_id} | |
| 17:00:43 | melwitt | sean-k-mooney: I think so, based on the abandoned patch. it was proposing to do that automatically (from error) | |
| 17:01:37 | sean-k-mooney | melwitt: ok if we did not you would have to do rest state (which is admin only?) + a hard reboot | |
| 17:02:12 | melwitt | mriedem: you mean failed before resize started right | |
| 17:02:29 | mriedem | no if the operation failed | |
| 17:02:49 | mriedem | if any event in an action (operation) fails, the overall action 'message' is always 'Error': https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/db/sqlalchemy/api.py#L5227 | |
| 17:02:53 | melwitt | sean-k-mooney: if we did not allow revert from error? I don't think reset state + reboot would put everything back properly | |
| 17:02:56 | mriedem | which is actually a bug... | |
| 17:03:18 | mriedem | https://bugs.launchpad.net/nova/+bug/1824420 | |
| 17:03:19 | openstack | Launchpad bug 1824420 in OpenStack Compute (nova) "Live migration succeeds but instance-action-list still has unexpected Error status" [Undecided,Triaged] | |
| 17:03:44 | melwitt | oh | |
| 17:04:30 | mriedem | so before we go down the road of "well the user can track the operatoin to see if it was auto-reverted on error because of instance actions" let me point out that relying on instance actions that way isn't fool proof because of that bug | |
| 17:04:43 | mriedem | and especially b/c it's a result of failures on hosts and then doing reschedules to other hosts | |
| 17:04:45 | mriedem | which resize can do | |
| 17:04:52 | sean-k-mooney | the instace should become active on the source host but it might not fix the allocation in placment properly | |
| 17:05:24 | mriedem | auto-reverting a failed resize could be all sorts of f'ed up | |
| 17:05:29 | mriedem | because rollbacks are near impossible | |
| 17:05:40 | mriedem | hard to test | |
| 17:05:58 | mriedem | i'm fairly certain our live migration rollback code is also quite janky in several ways | |
| 17:06:03 | mriedem | because we don't test it in the gate | |
| 17:08:07 | sean-k-mooney | just looking at that bug the live migration failed right? | |
| 17:09:07 | sean-k-mooney | so we would exepct there to be an error in the instance action log? | |
| 17:10:06 | mriedem | no | |
| 17:10:10 | mriedem | read my comments on the bug | |
| 17:10:29 | sean-k-mooney | maybe im missreading it as its kind of hard to read the initilal bug | |
| 17:10:31 | mriedem | a pre-check on one of the candidate dest hosts failed | |
| 17:10:44 | mriedem | which triggers a reschedule to another dest host in the conductor live migration task | |
| 17:10:49 | mriedem | the 2nd host works | |
| 17:11:17 | mriedem | but b/c the pre-check failed on the first dest host the instance action event for that one is error which sets the overall action message to 'Error' | |
| 17:11:36 | sean-k-mooney | ah ok | |
| 17:11:46 | mriedem | iow, actions aren't reschedule aware | |