Search Results

Posted	Nick	Remark
#openstack-nova - 2019-06-26
16:08:42	mriedem	the audit command could also take a --consumer option to just investigate what the operator thinks is a problem instance/migration
16:09:00	mriedem	note that i added --instance to heal_allocations later for that reason
16:09:07	bauzas	yup I saw
16:09:10	mriedem	and --dry-run
16:09:31	mriedem	depends on what the command will do though, if it's just reporting then you don't need a --dry-run
16:11:20	bauzas	I was thinking of just telling the orphaned, but then later adding a --remove option
16:11:33	bauzas	later
16:12:35	bauzas	anyway, needs to go off and run by hot summer nights
16:13:03	bauzas	I think I have everything I needed, thanks folks
16:14:57	dansmith	hoo boy
16:15:23	mriedem	strictly adult cheese, wine and things of that nature
16:16:35	melwitt	now, for another fun topic
16:17:02	melwitt	mriedem, dansmith: I was reading these comments on an old [unmerged] patch: https://review.opendev.org/#/c/462521/12/releasenotes/notes/resize-auto-revert-6e1648828aba16b2.yaml@5,
16:17:31	melwitt	and it made me think of the [recently merged] patch: https://review.opendev.org/633227 again and how it changed ERROR state to ACTIVE (or STOPPED) state. now I'm worried that wasn't an ok thing to do (API change?)
16:18:12	melwitt	for a failed cold migration to self
16:18:21	mriedem	not the same
16:18:27	mriedem	in my change,
16:18:35	mriedem	we failed in prep_resize before we actually did anything to the guest
16:18:46	mriedem	in that case, putting the instance in ERROR status makes no sense imo
16:19:08	mriedem	as i said, the only way you can get it out of error then is to do something like rebuild, hard reboot and/or reset status to ACTIVE,
16:19:16	dansmith	and was yours also resetting to ACTIVE if it was actually shutoff?
16:19:17	dansmith	I forget
16:19:26	mriedem	and if i started a resize or cold migration of a STOPPED instance, then resetting it to ACTIVE isn't what i want, nor is rebuild or hard reboot really
16:19:42	mriedem	dansmith: that was the point of my fix
16:19:52	mriedem	to reset to STOPPED if it was STOPPED
16:19:55	dansmith	mriedem: right
16:19:57	mriedem	well, in part,
16:20:02	mriedem	the main point was don't put it in ERROR status
16:21:59	melwitt	ok, I think I see. this is ok because the instance is actually ok (other than cosmetic), whereas for the first example, the instance was not ok and was proposed to auto-correct to an ok/healthy state
16:22:27	dansmith	the auto-revert actually moved stuff back, IIRC
16:22:37	dansmith	not just correcting state, but actual revert
16:22:39	melwitt	yeah it did
16:23:09	melwitt	I was zooming in on the vm_state part of it, how it appears to an external script like in your example in the comment
16:24:20	melwitt	and then I was thinking, is that a problem, if we imagine an external script executing a cold migrate and it fails and the instance stays ACTIVE so the script doesn't know it didn't work. that sort of thing
16:25:08	melwitt	I was wondering about that after I read the comments on the old auto-revert patch
16:26:00	dansmith	but the difference is,
16:26:14	mriedem	the external thing should be waiting for task_state to be None to know the operation is done (or the instance action is finished/error, or the migration status is 'finished' or whatever in this case)
16:26:18	dansmith	the merged patch corrected state before it changed from $orig to MIGRATING or whatever, right?
16:26:20	mriedem	polling the vm_state in the API is not sufficient
16:26:42	dansmith	the auto-revert one has it go into all the migrating states and then pop back
16:27:16	dansmith	specifically, potentially pop back to ACTIVE and not have moved, IIRC
16:27:43	melwitt	yes, I believe it did prevent an ERROR state that occurred before going to MIGRATING
16:28:22	mriedem	i'm getting lost in the "it" references here when talking about separate changes
16:28:32	melwitt	heh, sorry. the merged change
16:28:35	mriedem	booth's change was,
16:28:51	mriedem	resize/cold migrated failed somewhere and somehow, and the instance was set to ERROR status, right?
16:29:14	mriedem	and if you tried doing a revertResize API call on that ERROR instance, it would do the revert resize flow to go back from the dest to the source host
16:29:31	dansmith	no, it did a full revert I think
16:29:37	mriedem	even though what we could have failed on was maybe something in prep_resize or resize_instance before the guest / volumes / networking ever actually got to the dest host
16:30:10	dansmith	so we get to the dest host, fail, auto-revert back to source, and go back to ACTIVE
16:30:23	dansmith	you wait for ACTIVE to mean "success" but really it failed and the instance hasn't resized or move
16:30:37	melwitt	yeah, I think it was a full revert on the booth change. i.e. do automatically what a user would have to do, initiate a revert
16:30:37	mriedem	oh i see https://review.opendev.org/#/c/462521/12/nova/compute/manager.py@4449
16:30:39	dansmith	granted it's been 18 months since I last looked at this
16:30:52	dansmith	it's really the opposite of what mriedem's change was doing,
16:31:06	dansmith	which was keep it active if we don't start
16:31:22	mriedem	or stopped rather than active...
16:31:33	dansmith	well, and that's an important piece yeah
16:31:37	mriedem	i.e. start resize with a stopped server, prep_resize fails, don't reset to active because it's stopped
16:31:44	dansmith	right
16:31:53	mriedem	eventually the power sync task would stop the instance i think but still
16:32:08	dansmith	or restart it when it shouldn't, right?
16:32:17	melwitt	yeah, makes sense
16:32:18	dansmith	if vm_state is active, it was stopped, power state sync says "hmm, this should be running"
16:32:21	mriedem	i don't think that task ever starts anything
16:32:38	mriedem	even though people have asked for that in the past
16:32:54	dansmith	no? I thought it would for things like post-host-failure recovery
16:32:56	mriedem	i believe the reasoning was always, we don't want to turn things on by guessing and then bill the user
16:33:11	dansmith	well, billing is unrelated to started or stopped, but okay :)
16:33:26	dansmith	it's a complex enough not-really-a-state-machine that I'm sure I'm getting it wrong
16:33:28	mriedem	depends on how you do your billing
16:33:35	dansmith	regardless, ACTIVE but not running is about as bad
16:33:37	mriedem	same - it's been a long time since i loked
16:33:42	mriedem	*looked
16:34:09	mriedem	anyway, i agree that if i'm doing a resize (and i'm sure tempest would do this), you're waiting for the instance to go to VERIFY_RESIZE with task_state=None,
16:34:23	mriedem	it the instance goes back to ACTIVE with task_state=None, i'd wait indefinitely
16:34:28	mriedem	unless i've got a timeout,
16:34:43	dansmith	especially if you went into RESIZING in between
16:34:46	mriedem	or also checking instance actions or migration status (which might be admin-only inof)
16:34:47	mriedem	*info
16:35:10	mriedem	i personally wouldn't try to track the task_state transitions since that's probably a losing game
16:35:21	mriedem	i would just wait for terminal states but yeah
16:35:31	dansmith	the thing is, ACTIVE is a terminal state for auto-confirm
16:35:45	mriedem	true yeah
16:35:46	dansmith	so if it went ACTIVE -> RESIZING -> ACTIVE, you should assume it actually resized and was auto-confirmed
16:35:51	dansmith	but with auto-revert,
16:35:55	mriedem	i know powervc set auto-confirm to 1 second
16:35:56	dansmith	that breaks that behavior
16:36:12	mriedem	lbragstad had to fix a few race bugs as a result :)
16:36:15	dansmith	with auto-revert, ACTIVE->RESIZING->ACTIVE could mean "it worked" or "it didn't"
16:36:35	mriedem	dansmith: yeah, and you wouldn't know unless you checked the migratoin or instance actions, which you as a non-admin might not have access to those details
16:36:42	melwitt	yeah, I see
16:36:56	dansmith	it turns waiting for a terminal state into a much more complex affair for sure
16:37:02	openstackgerrit	Merged openstack/nova master: Replace deprecated with_lockmode with with_for_update https://review.opendev.org/666221
16:37:57	melwitt	that's a helpful way to think about it, imagining what a tempest (or func test) would need to do to be able to automate it
16:40:15	mriedem	maybe should link this conversation into the abandoned change so we have that when this comes up again in 2 years :)
16:40:41	melwitt	yeah, that's a good idea. let me do that now
16:44:45	sean-k-mooney	... i started reading the scroll back and i think on second tought i not going to do that
16:47:29	sean-k-mooney	melwitt: the only way for a non admin to deterim if a cold migrate suceeded would be to check the hashed host id before and after
16:48:19	sean-k-mooney	for resize they could check the if the flavor is the one they expected

Earlier Later