Earlier  
Posted Nick Remark
#openstack-nova - 2019-06-26
16:34:46 mriedem or also checking instance actions or migration status (which might be admin-only inof)
16:34:47 mriedem *info
16:35:10 mriedem i personally wouldn't try to track the task_state transitions since that's probably a losing game
16:35:21 mriedem i would just wait for terminal states but yeah
16:35:31 dansmith the thing is, ACTIVE is a terminal state for auto-confirm
16:35:45 mriedem true yeah
16:35:46 dansmith so if it went ACTIVE -> RESIZING -> ACTIVE, you should assume it actually resized and was auto-confirmed
16:35:51 dansmith but with auto-revert,
16:35:55 mriedem i know powervc set auto-confirm to 1 second
16:35:56 dansmith that breaks that behavior
16:36:12 mriedem lbragstad had to fix a few race bugs as a result :)
16:36:15 dansmith with auto-revert, ACTIVE->RESIZING->ACTIVE could mean "it worked" or "it didn't"
16:36:35 mriedem dansmith: yeah, and you wouldn't know unless you checked the migratoin or instance actions, which you as a non-admin might not have access to those details
16:36:42 melwitt yeah, I see
16:36:56 dansmith it turns waiting for a terminal state into a much more complex affair for sure
16:37:02 openstackgerrit Merged openstack/nova master: Replace deprecated with_lockmode with with_for_update https://review.opendev.org/666221
16:37:57 melwitt that's a helpful way to think about it, imagining what a tempest (or func test) would need to do to be able to automate it
16:40:15 mriedem maybe should link this conversation into the abandoned change so we have that when this comes up again in 2 years :)
16:40:41 melwitt yeah, that's a good idea. let me do that now
16:44:45 sean-k-mooney ... i started reading the scroll back and i think on second tought i not going to do that
16:47:29 sean-k-mooney melwitt: the only way for a non admin to deterim if a cold migrate suceeded would be to check the hashed host id before and after
16:48:19 sean-k-mooney for resize they could check the if the flavor is the one they expected
16:48:31 dansmith sean-k-mooney: not really
16:48:43 dansmith oh for a strict migration, yeah
16:48:51 dansmith was going to say, resize to same host breaks that assumption
16:49:02 melwitt sean-k-mooney: could also observe ACTIVE -> RESIZING -> ACTIVE as dansmith described, right? as non-admin
16:49:22 dansmith melwitt: yes
16:49:34 sean-k-mooney you could observe it if you pool but you would not know if it succeded or failed
16:49:43 sean-k-mooney without also checking if the falvor is the old or new one
16:49:58 dansmith sean-k-mooney: you won't go back to active from resizing currently
16:50:28 sean-k-mooney oh ok so that was the change ye were talking about
16:50:36 melwitt sean-k-mooney: if it failed [after going to RESIZING] it would go to ERROR. are you talking hypothetically with the abandoned patch?
16:51:15 sean-k-mooney melwitt: there are case we i though it would auto revert that went back to active
16:51:42 melwitt sean-k-mooney: no, that was the proposal in the abandoned patch
16:52:21 sean-k-mooney ok i might be thinking about live migrate then
16:52:41 sean-k-mooney for live migrate we can fail to migrate but still be in active
16:58:20 sean-k-mooney so ya looking at code earch revert_resize is only ever called form the api which simplifes some things but not others
16:59:28 sean-k-mooney melwitt: do we currently allow you to revert a resize for an instance that is in error because the resize failed
16:59:58 sean-k-mooney so you can go active->resizeing->error->active?
17:00:36 mriedem fwiw, as a non-admin i think you can tell if your resize failed if the instance action "message" is not null /servers/{server_id}/os-instance-actions/{request_id}
17:00:40 mriedem er GET /servers/{server_id}/os-instance-actions/{request_id}
17:00:43 melwitt sean-k-mooney: I think so, based on the abandoned patch. it was proposing to do that automatically (from error)
17:01:37 sean-k-mooney melwitt: ok if we did not you would have to do rest state (which is admin only?) + a hard reboot
17:02:12 melwitt mriedem: you mean failed before resize started right
17:02:29 mriedem no if the operation failed
17:02:49 mriedem if any event in an action (operation) fails, the overall action 'message' is always 'Error': https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/db/sqlalchemy/api.py#L5227
17:02:53 melwitt sean-k-mooney: if we did not allow revert from error? I don't think reset state + reboot would put everything back properly
17:02:56 mriedem which is actually a bug...
17:03:18 mriedem https://bugs.launchpad.net/nova/+bug/1824420
17:03:19 openstack Launchpad bug 1824420 in OpenStack Compute (nova) "Live migration succeeds but instance-action-list still has unexpected Error status" [Undecided,Triaged]
17:03:44 melwitt oh
17:04:30 mriedem so before we go down the road of "well the user can track the operatoin to see if it was auto-reverted on error because of instance actions" let me point out that relying on instance actions that way isn't fool proof because of that bug
17:04:43 mriedem and especially b/c it's a result of failures on hosts and then doing reschedules to other hosts
17:04:45 mriedem which resize can do
17:04:52 sean-k-mooney the instace should become active on the source host but it might not fix the allocation in placment properly
17:05:24 mriedem auto-reverting a failed resize could be all sorts of f'ed up
17:05:29 mriedem because rollbacks are near impossible
17:05:40 mriedem hard to test
17:05:58 mriedem i'm fairly certain our live migration rollback code is also quite janky in several ways
17:06:03 mriedem because we don't test it in the gate
17:08:07 sean-k-mooney just looking at that bug the live migration failed right?
17:09:07 sean-k-mooney so we would exepct there to be an error in the instance action log?
17:10:06 mriedem no
17:10:10 mriedem read my comments on the bug
17:10:29 sean-k-mooney maybe im missreading it as its kind of hard to read the initilal bug
17:10:31 mriedem a pre-check on one of the candidate dest hosts failed
17:10:44 mriedem which triggers a reschedule to another dest host in the conductor live migration task
17:10:49 mriedem the 2nd host works
17:11:17 mriedem but b/c the pre-check failed on the first dest host the instance action event for that one is error which sets the overall action message to 'Error'
17:11:36 sean-k-mooney ah ok
17:11:46 mriedem iow, actions aren't reschedule aware
17:11:56 mriedem or what is a non-fatal error
17:12:05 sean-k-mooney right
17:12:27 sean-k-mooney should we be loging the prechecks as events?
17:12:59 sean-k-mooney i was not exepcting to see compute_check_can_live_migrate_destination events
17:13:25 mriedem hard to say
17:13:33 mriedem if you configure nova to not have alternate hosts for reschedules
17:13:46 mriedem then you'd likely want to know it's that dest pre-check that failed right?
17:14:18 sean-k-mooney maybe or jsut that you had no valid hosts?
17:14:29 sean-k-mooney / exausted teh list of alternates
17:14:57 sean-k-mooney i thikn we would still log the failure right
17:15:02 openstackgerrit Merged openstack/nova master: Remove orphaned comment from _get_group_details https://review.opendev.org/667135
17:15:07 mriedem sure, if you set https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.migrate_max_retries to 0 for now retries, or you don't have any alternate hosts
17:15:44 mriedem idk, anyway, it's tangential to the auto-revert failed resize thing mel was asking about
17:15:45 sean-k-mooney for me this feels like we are leaking an implemenation detail as an event
17:16:04 sean-k-mooney ya it is
17:16:11 mriedem instance action events are basically entirely leaked implementation details :)
17:16:17 mriedem the event names come from the methods they decorate
17:16:27 mriedem there is no guarantee on api stability for those things
17:17:17 sean-k-mooney ok personally i would prefer not to decorate that function
17:17:45 sean-k-mooney but as you said its tangental to melwitt's topic
17:36:44 openstackgerrit Lee Yarwood proposed openstack/nova master: libvirt: Add a rbd_connect_timeout configurable https://review.opendev.org/667421
17:36:57 openstackgerrit Eric Fried proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487
17:38:12 Nick_A any idea why metadata would send all /24 routes in a region to each instance? http://paste.openstack.org/show/y0lE42EA59yhnu7G1KnY/
17:38:25 openstackgerrit Matt Riedemann proposed openstack/nova master: Fix AttributeError in RT._update_usage_from_migration https://review.opendev.org/667687
17:38:26 openstackgerrit Matt Riedemann proposed openstack/nova master: Fix RT init arg order in test_unsupported_move_type https://review.opendev.org/667688
17:48:03 openstackgerrit Ghanshyam Mann proposed openstack/nova master: Multiple API cleanup changes https://review.opendev.org/666889
17:59:52 sean-k-mooney yonglihe: efried just reviewing https://review.opendev.org/#/c/648912/14 but why are we looking up instance by name?
18:01:58 efried sean-k-mooney: I haven't the foggiest. I'm involved here in an administrative capacity :)
18:02:43 sean-k-mooney the domain xml has the instance uuid stored in the uuid field for several release now so im wondering why we would use the instance domain name instead

Earlier   Later