Search Results

Posted	Nick	Remark
#openstack-nova - 2019-06-26
16:29:37	mriedem	even though what we could have failed on was maybe something in prep_resize or resize_instance before the guest / volumes / networking ever actually got to the dest host
16:30:10	dansmith	so we get to the dest host, fail, auto-revert back to source, and go back to ACTIVE
16:30:23	dansmith	you wait for ACTIVE to mean "success" but really it failed and the instance hasn't resized or move
16:30:37	mriedem	oh i see https://review.opendev.org/#/c/462521/12/nova/compute/manager.py@4449
16:30:37	melwitt	yeah, I think it was a full revert on the booth change. i.e. do automatically what a user would have to do, initiate a revert
16:30:39	dansmith	granted it's been 18 months since I last looked at this
16:30:52	dansmith	it's really the opposite of what mriedem's change was doing,
16:31:06	dansmith	which was keep it active if we don't start
16:31:22	mriedem	or stopped rather than active...
16:31:33	dansmith	well, and that's an important piece yeah
16:31:37	mriedem	i.e. start resize with a stopped server, prep_resize fails, don't reset to active because it's stopped
16:31:44	dansmith	right
16:31:53	mriedem	eventually the power sync task would stop the instance i think but still
16:32:08	dansmith	or restart it when it shouldn't, right?
16:32:17	melwitt	yeah, makes sense
16:32:18	dansmith	if vm_state is active, it was stopped, power state sync says "hmm, this should be running"
16:32:21	mriedem	i don't think that task ever starts anything
16:32:38	mriedem	even though people have asked for that in the past
16:32:54	dansmith	no? I thought it would for things like post-host-failure recovery
16:32:56	mriedem	i believe the reasoning was always, we don't want to turn things on by guessing and then bill the user
16:33:11	dansmith	well, billing is unrelated to started or stopped, but okay :)
16:33:26	dansmith	it's a complex enough not-really-a-state-machine that I'm sure I'm getting it wrong
16:33:28	mriedem	depends on how you do your billing
16:33:35	dansmith	regardless, ACTIVE but not running is about as bad
16:33:37	mriedem	same - it's been a long time since i loked
16:33:42	mriedem	*looked
16:34:09	mriedem	anyway, i agree that if i'm doing a resize (and i'm sure tempest would do this), you're waiting for the instance to go to VERIFY_RESIZE with task_state=None,
16:34:23	mriedem	it the instance goes back to ACTIVE with task_state=None, i'd wait indefinitely
16:34:28	mriedem	unless i've got a timeout,
16:34:43	dansmith	especially if you went into RESIZING in between
16:34:46	mriedem	or also checking instance actions or migration status (which might be admin-only inof)
16:34:47	mriedem	*info
16:35:10	mriedem	i personally wouldn't try to track the task_state transitions since that's probably a losing game
16:35:21	mriedem	i would just wait for terminal states but yeah
16:35:31	dansmith	the thing is, ACTIVE is a terminal state for auto-confirm
16:35:45	mriedem	true yeah
16:35:46	dansmith	so if it went ACTIVE -> RESIZING -> ACTIVE, you should assume it actually resized and was auto-confirmed
16:35:51	dansmith	but with auto-revert,
16:35:55	mriedem	i know powervc set auto-confirm to 1 second
16:35:56	dansmith	that breaks that behavior
16:36:12	mriedem	lbragstad had to fix a few race bugs as a result :)
16:36:15	dansmith	with auto-revert, ACTIVE->RESIZING->ACTIVE could mean "it worked" or "it didn't"
16:36:35	mriedem	dansmith: yeah, and you wouldn't know unless you checked the migratoin or instance actions, which you as a non-admin might not have access to those details
16:36:42	melwitt	yeah, I see
16:36:56	dansmith	it turns waiting for a terminal state into a much more complex affair for sure
16:37:02	openstackgerrit	Merged openstack/nova master: Replace deprecated with_lockmode with with_for_update https://review.opendev.org/666221
16:37:57	melwitt	that's a helpful way to think about it, imagining what a tempest (or func test) would need to do to be able to automate it
16:40:15	mriedem	maybe should link this conversation into the abandoned change so we have that when this comes up again in 2 years :)
16:40:41	melwitt	yeah, that's a good idea. let me do that now
16:44:45	sean-k-mooney	... i started reading the scroll back and i think on second tought i not going to do that
16:47:29	sean-k-mooney	melwitt: the only way for a non admin to deterim if a cold migrate suceeded would be to check the hashed host id before and after
16:48:19	sean-k-mooney	for resize they could check the if the flavor is the one they expected
16:48:31	dansmith	sean-k-mooney: not really
16:48:43	dansmith	oh for a strict migration, yeah
16:48:51	dansmith	was going to say, resize to same host breaks that assumption
16:49:02	melwitt	sean-k-mooney: could also observe ACTIVE -> RESIZING -> ACTIVE as dansmith described, right? as non-admin
16:49:22	dansmith	melwitt: yes
16:49:34	sean-k-mooney	you could observe it if you pool but you would not know if it succeded or failed
16:49:43	sean-k-mooney	without also checking if the falvor is the old or new one
16:49:58	dansmith	sean-k-mooney: you won't go back to active from resizing currently
16:50:28	sean-k-mooney	oh ok so that was the change ye were talking about
16:50:36	melwitt	sean-k-mooney: if it failed [after going to RESIZING] it would go to ERROR. are you talking hypothetically with the abandoned patch?
16:51:15	sean-k-mooney	melwitt: there are case we i though it would auto revert that went back to active
16:51:42	melwitt	sean-k-mooney: no, that was the proposal in the abandoned patch
16:52:21	sean-k-mooney	ok i might be thinking about live migrate then
16:52:41	sean-k-mooney	for live migrate we can fail to migrate but still be in active
16:58:20	sean-k-mooney	so ya looking at code earch revert_resize is only ever called form the api which simplifes some things but not others
16:59:28	sean-k-mooney	melwitt: do we currently allow you to revert a resize for an instance that is in error because the resize failed
16:59:58	sean-k-mooney	so you can go active->resizeing->error->active?
17:00:36	mriedem	fwiw, as a non-admin i think you can tell if your resize failed if the instance action "message" is not null /servers/{server_id}/os-instance-actions/{request_id}
17:00:40	mriedem	er GET /servers/{server_id}/os-instance-actions/{request_id}
17:00:43	melwitt	sean-k-mooney: I think so, based on the abandoned patch. it was proposing to do that automatically (from error)
17:01:37	sean-k-mooney	melwitt: ok if we did not you would have to do rest state (which is admin only?) + a hard reboot
17:02:12	melwitt	mriedem: you mean failed before resize started right
17:02:29	mriedem	no if the operation failed
17:02:49	mriedem	if any event in an action (operation) fails, the overall action 'message' is always 'Error': https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/db/sqlalchemy/api.py#L5227
17:02:53	melwitt	sean-k-mooney: if we did not allow revert from error? I don't think reset state + reboot would put everything back properly
17:02:56	mriedem	which is actually a bug...
17:03:18	mriedem	https://bugs.launchpad.net/nova/+bug/1824420
17:03:19	openstack	Launchpad bug 1824420 in OpenStack Compute (nova) "Live migration succeeds but instance-action-list still has unexpected Error status" [Undecided,Triaged]
17:03:44	melwitt	oh
17:04:30	mriedem	so before we go down the road of "well the user can track the operatoin to see if it was auto-reverted on error because of instance actions" let me point out that relying on instance actions that way isn't fool proof because of that bug
17:04:43	mriedem	and especially b/c it's a result of failures on hosts and then doing reschedules to other hosts
17:04:45	mriedem	which resize can do
17:04:52	sean-k-mooney	the instace should become active on the source host but it might not fix the allocation in placment properly
17:05:24	mriedem	auto-reverting a failed resize could be all sorts of f'ed up
17:05:29	mriedem	because rollbacks are near impossible
17:05:40	mriedem	hard to test
17:05:58	mriedem	i'm fairly certain our live migration rollback code is also quite janky in several ways
17:06:03	mriedem	because we don't test it in the gate
17:08:07	sean-k-mooney	just looking at that bug the live migration failed right?
17:09:07	sean-k-mooney	so we would exepct there to be an error in the instance action log?
17:10:06	mriedem	no
17:10:10	mriedem	read my comments on the bug
17:10:29	sean-k-mooney	maybe im missreading it as its kind of hard to read the initilal bug
17:10:31	mriedem	a pre-check on one of the candidate dest hosts failed
17:10:44	mriedem	which triggers a reschedule to another dest host in the conductor live migration task
17:10:49	mriedem	the 2nd host works
17:11:17	mriedem	but b/c the pre-check failed on the first dest host the instance action event for that one is error which sets the overall action message to 'Error'
17:11:36	sean-k-mooney	ah ok

Earlier Later