| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-26 | |||
| 16:08:42 | mriedem | the audit command could also take a --consumer option to just investigate what the operator thinks is a problem instance/migration | |
| 16:09:00 | mriedem | note that i added --instance to heal_allocations later for that reason | |
| 16:09:07 | bauzas | yup I saw | |
| 16:09:10 | mriedem | and --dry-run | |
| 16:09:31 | mriedem | depends on what the command will do though, if it's just reporting then you don't need a --dry-run | |
| 16:11:20 | bauzas | I was thinking of just telling the orphaned, but then later adding a --remove option | |
| 16:11:33 | bauzas | *later* | |
| 16:12:35 | bauzas | anyway, needs to go off and run by hot summer nights | |
| 16:13:03 | bauzas | I think I have everything I needed, thanks folks | |
| 16:14:57 | dansmith | hoo boy | |
| 16:15:23 | mriedem | strictly adult cheese, wine and things of that nature | |
| 16:16:35 | melwitt | now, for another fun topic | |
| 16:17:02 | melwitt | mriedem, dansmith: I was reading these comments on an old [unmerged] patch: https://review.opendev.org/#/c/462521/12/releasenotes/notes/resize-auto-revert-6e1648828aba16b2.yaml@5, | |
| 16:17:31 | melwitt | and it made me think of the [recently merged] patch: https://review.opendev.org/633227 again and how it changed ERROR state to ACTIVE (or STOPPED) state. now I'm worried that wasn't an ok thing to do (API change?) | |
| 16:18:12 | melwitt | for a failed cold migration to self | |
| 16:18:21 | mriedem | not the same | |
| 16:18:27 | mriedem | in my change, | |
| 16:18:35 | mriedem | we failed in prep_resize before we actually did anything to the guest | |
| 16:18:46 | mriedem | in that case, putting the instance in ERROR status makes no sense imo | |
| 16:19:08 | mriedem | as i said, the only way you can get it out of error then is to do something like rebuild, hard reboot and/or reset status to ACTIVE, | |
| 16:19:16 | dansmith | and was yours also resetting to ACTIVE if it was actually shutoff? | |
| 16:19:17 | dansmith | I forget | |
| 16:19:26 | mriedem | and if i started a resize or cold migration of a STOPPED instance, then resetting it to ACTIVE isn't what i want, nor is rebuild or hard reboot really | |
| 16:19:42 | mriedem | dansmith: that was the point of my fix | |
| 16:19:52 | mriedem | to reset to STOPPED if it was STOPPED | |
| 16:19:55 | dansmith | mriedem: right | |
| 16:19:57 | mriedem | well, in part, | |
| 16:20:02 | mriedem | the main point was don't put it in ERROR status | |
| 16:21:59 | melwitt | ok, I think I see. this is ok because the instance is actually ok (other than cosmetic), whereas for the first example, the instance was not ok and was proposed to auto-correct to an ok/healthy state | |
| 16:22:27 | dansmith | the auto-revert actually moved stuff back, IIRC | |
| 16:22:37 | dansmith | not just correcting state, but actual revert | |
| 16:22:39 | melwitt | yeah it did | |
| 16:23:09 | melwitt | I was zooming in on the vm_state part of it, how it appears to an external script like in your example in the comment | |
| 16:24:20 | melwitt | and then I was thinking, is that a problem, if we imagine an external script executing a cold migrate and it fails and the instance stays ACTIVE so the script doesn't know it didn't work. that sort of thing | |
| 16:25:08 | melwitt | I was wondering about that after I read the comments on the old auto-revert patch | |
| 16:26:00 | dansmith | but the difference is, | |
| 16:26:14 | mriedem | the external thing should be waiting for task_state to be None to know the operation is done (or the instance action is finished/error, or the migration status is 'finished' or whatever in this case) | |
| 16:26:18 | dansmith | the merged patch corrected state before it changed from $orig to MIGRATING or whatever, right? | |
| 16:26:20 | mriedem | polling the vm_state in the API is not sufficient | |
| 16:26:42 | dansmith | the auto-revert one has it go into all the migrating states and then pop back | |
| 16:27:16 | dansmith | specifically, potentially pop back to ACTIVE and not have moved, IIRC | |
| 16:27:43 | melwitt | yes, I believe it did prevent an ERROR state that occurred before going to MIGRATING | |
| 16:28:22 | mriedem | i'm getting lost in the "it" references here when talking about separate changes | |
| 16:28:32 | melwitt | heh, sorry. the merged change | |
| 16:28:35 | mriedem | booth's change was, | |
| 16:28:51 | mriedem | resize/cold migrated failed somewhere and somehow, and the instance was set to ERROR status, right? | |
| 16:29:14 | mriedem | and if you tried doing a revertResize API call on that ERROR instance, it would do the revert resize flow to go back from the dest to the source host | |
| 16:29:31 | dansmith | no, it did a full revert I think | |
| 16:29:37 | mriedem | even though what we could have failed on was maybe something in prep_resize or resize_instance before the guest / volumes / networking ever actually *got* to the dest host | |
| 16:30:10 | dansmith | so we get to the dest host, fail, auto-revert back to source, and go back to ACTIVE | |
| 16:30:23 | dansmith | you wait for ACTIVE to mean "success" but really it failed and the instance hasn't resized or move | |
| 16:30:37 | melwitt | yeah, I think it was a full revert on the booth change. i.e. do automatically what a user would have to do, initiate a revert | |
| 16:30:37 | mriedem | oh i see https://review.opendev.org/#/c/462521/12/nova/compute/manager.py@4449 | |
| 16:30:39 | dansmith | granted it's been 18 months since I last looked at this | |
| 16:30:52 | dansmith | it's really the opposite of what mriedem's change was doing, | |
| 16:31:06 | dansmith | which was keep it active if we don't start | |
| 16:31:22 | mriedem | or stopped rather than active... | |
| 16:31:33 | dansmith | well, and that's an important piece yeah | |
| 16:31:37 | mriedem | i.e. start resize with a stopped server, prep_resize fails, don't reset to active *because it's stopped* | |
| 16:31:44 | dansmith | right | |
| 16:31:53 | mriedem | eventually the power sync task would stop the instance i think but still | |
| 16:32:08 | dansmith | or restart it when it shouldn't, right? | |
| 16:32:17 | melwitt | yeah, makes sense | |
| 16:32:18 | dansmith | if vm_state is active, it was stopped, power state sync says "hmm, this should be running" | |
| 16:32:21 | mriedem | i don't think that task ever starts anything | |
| 16:32:38 | mriedem | even though people have asked for that in the past | |
| 16:32:54 | dansmith | no? I thought it would for things like post-host-failure recovery | |
| 16:32:56 | mriedem | i believe the reasoning was always, we don't want to turn things on by guessing and then bill the user | |
| 16:33:11 | dansmith | well, billing is unrelated to started or stopped, but okay :) | |
| 16:33:26 | dansmith | it's a complex enough not-really-a-state-machine that I'm sure I'm getting it wrong | |
| 16:33:28 | mriedem | depends on how you do your billing | |
| 16:33:35 | dansmith | regardless, ACTIVE but not running is about as bad | |
| 16:33:37 | mriedem | same - it's been a long time since i loked | |
| 16:33:42 | mriedem | *looked | |
| 16:34:09 | mriedem | anyway, i agree that if i'm doing a resize (and i'm sure tempest would do this), you're waiting for the instance to go to VERIFY_RESIZE with task_state=None, | |
| 16:34:23 | mriedem | it the instance goes back to ACTIVE with task_state=None, i'd wait indefinitely | |
| 16:34:28 | mriedem | unless i've got a timeout, | |
| 16:34:43 | dansmith | especially if you went into RESIZING in between | |
| 16:34:46 | mriedem | or also checking instance actions or migration status (which might be admin-only inof) | |
| 16:34:47 | mriedem | *info | |
| 16:35:10 | mriedem | i personally wouldn't try to track the task_state transitions since that's probably a losing game | |
| 16:35:21 | mriedem | i would just wait for terminal states but yeah | |
| 16:35:31 | dansmith | the thing is, ACTIVE is a terminal state for auto-confirm | |
| 16:35:45 | mriedem | true yeah | |
| 16:35:46 | dansmith | so if it went ACTIVE -> RESIZING -> ACTIVE, you should assume it actually resized and was auto-confirmed | |
| 16:35:51 | dansmith | but with auto-revert, | |
| 16:35:55 | mriedem | i know powervc set auto-confirm to 1 second | |
| 16:35:56 | dansmith | that breaks that behavior | |
| 16:36:12 | mriedem | lbragstad had to fix a few race bugs as a result :) | |
| 16:36:15 | dansmith | with auto-revert, ACTIVE->RESIZING->ACTIVE could mean "it worked" or "it didn't" | |
| 16:36:35 | mriedem | dansmith: yeah, and you wouldn't know unless you checked the migratoin or instance actions, which you as a non-admin might not have access to those details | |
| 16:36:42 | melwitt | yeah, I see | |
| 16:36:56 | dansmith | it turns waiting for a terminal state into a much more complex affair for sure | |
| 16:37:02 | openstackgerrit | Merged openstack/nova master: Replace deprecated with_lockmode with with_for_update https://review.opendev.org/666221 | |
| 16:37:57 | melwitt | that's a helpful way to think about it, imagining what a tempest (or func test) would need to do to be able to automate it | |
| 16:40:15 | mriedem | maybe should link this conversation into the abandoned change so we have that when this comes up again in 2 years :) | |
| 16:40:41 | melwitt | yeah, that's a good idea. let me do that now | |
| 16:44:45 | sean-k-mooney | ... i started reading the scroll back and i think on second tought i not going to do that | |
| 16:47:29 | sean-k-mooney | melwitt: the only way for a non admin to deterim if a cold migrate suceeded would be to check the hashed host id before and after | |
| 16:48:19 | sean-k-mooney | for resize they could check the if the flavor is the one they expected | |