Earlier  
Posted Nick Remark
#openstack-nova - 2019-06-26
16:52:41 sean-k-mooney for live migrate we can fail to migrate but still be in active
16:58:20 sean-k-mooney so ya looking at code earch revert_resize is only ever called form the api which simplifes some things but not others
16:59:28 sean-k-mooney melwitt: do we currently allow you to revert a resize for an instance that is in error because the resize failed
16:59:58 sean-k-mooney so you can go active->resizeing->error->active?
17:00:36 mriedem fwiw, as a non-admin i think you can tell if your resize failed if the instance action "message" is not null /servers/{server_id}/os-instance-actions/{request_id}
17:00:40 mriedem er GET /servers/{server_id}/os-instance-actions/{request_id}
17:00:43 melwitt sean-k-mooney: I think so, based on the abandoned patch. it was proposing to do that automatically (from error)
17:01:37 sean-k-mooney melwitt: ok if we did not you would have to do rest state (which is admin only?) + a hard reboot
17:02:12 melwitt mriedem: you mean failed before resize started right
17:02:29 mriedem no if the operation failed
17:02:49 mriedem if any event in an action (operation) fails, the overall action 'message' is always 'Error': https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/db/sqlalchemy/api.py#L5227
17:02:53 melwitt sean-k-mooney: if we did not allow revert from error? I don't think reset state + reboot would put everything back properly
17:02:56 mriedem which is actually a bug...
17:03:18 mriedem https://bugs.launchpad.net/nova/+bug/1824420
17:03:19 openstack Launchpad bug 1824420 in OpenStack Compute (nova) "Live migration succeeds but instance-action-list still has unexpected Error status" [Undecided,Triaged]
17:03:44 melwitt oh
17:04:30 mriedem so before we go down the road of "well the user can track the operatoin to see if it was auto-reverted on error because of instance actions" let me point out that relying on instance actions that way isn't fool proof because of that bug
17:04:43 mriedem and especially b/c it's a result of failures on hosts and then doing reschedules to other hosts
17:04:45 mriedem which resize can do
17:04:52 sean-k-mooney the instace should become active on the source host but it might not fix the allocation in placment properly
17:05:24 mriedem auto-reverting a failed resize could be all sorts of f'ed up
17:05:29 mriedem because rollbacks are near impossible
17:05:40 mriedem hard to test
17:05:58 mriedem i'm fairly certain our live migration rollback code is also quite janky in several ways
17:06:03 mriedem because we don't test it in the gate
17:08:07 sean-k-mooney just looking at that bug the live migration failed right?
17:09:07 sean-k-mooney so we would exepct there to be an error in the instance action log?
17:10:06 mriedem no
17:10:10 mriedem read my comments on the bug
17:10:29 sean-k-mooney maybe im missreading it as its kind of hard to read the initilal bug
17:10:31 mriedem a pre-check on one of the candidate dest hosts failed
17:10:44 mriedem which triggers a reschedule to another dest host in the conductor live migration task
17:10:49 mriedem the 2nd host works
17:11:17 mriedem but b/c the pre-check failed on the first dest host the instance action event for that one is error which sets the overall action message to 'Error'
17:11:36 sean-k-mooney ah ok
17:11:46 mriedem iow, actions aren't reschedule aware
17:11:56 mriedem or what is a non-fatal error
17:12:05 sean-k-mooney right
17:12:27 sean-k-mooney should we be loging the prechecks as events?
17:12:59 sean-k-mooney i was not exepcting to see compute_check_can_live_migrate_destination events
17:13:25 mriedem hard to say
17:13:33 mriedem if you configure nova to not have alternate hosts for reschedules
17:13:46 mriedem then you'd likely want to know it's that dest pre-check that failed right?
17:14:18 sean-k-mooney maybe or jsut that you had no valid hosts?
17:14:29 sean-k-mooney / exausted teh list of alternates
17:14:57 sean-k-mooney i thikn we would still log the failure right
17:15:02 openstackgerrit Merged openstack/nova master: Remove orphaned comment from _get_group_details https://review.opendev.org/667135
17:15:07 mriedem sure, if you set https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.migrate_max_retries to 0 for now retries, or you don't have any alternate hosts
17:15:44 mriedem idk, anyway, it's tangential to the auto-revert failed resize thing mel was asking about
17:15:45 sean-k-mooney for me this feels like we are leaking an implemenation detail as an event
17:16:04 sean-k-mooney ya it is
17:16:11 mriedem instance action events are basically entirely leaked implementation details :)
17:16:17 mriedem the event names come from the methods they decorate
17:16:27 mriedem there is no guarantee on api stability for those things
17:17:17 sean-k-mooney ok personally i would prefer not to decorate that function
17:17:45 sean-k-mooney but as you said its tangental to melwitt's topic
17:36:44 openstackgerrit Lee Yarwood proposed openstack/nova master: libvirt: Add a rbd_connect_timeout configurable https://review.opendev.org/667421
17:36:57 openstackgerrit Eric Fried proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487
17:38:12 Nick_A any idea why metadata would send all /24 routes in a region to each instance? http://paste.openstack.org/show/y0lE42EA59yhnu7G1KnY/
17:38:25 openstackgerrit Matt Riedemann proposed openstack/nova master: Fix AttributeError in RT._update_usage_from_migration https://review.opendev.org/667687
17:38:26 openstackgerrit Matt Riedemann proposed openstack/nova master: Fix RT init arg order in test_unsupported_move_type https://review.opendev.org/667688
17:48:03 openstackgerrit Ghanshyam Mann proposed openstack/nova master: Multiple API cleanup changes https://review.opendev.org/666889
17:59:52 sean-k-mooney yonglihe: efried just reviewing https://review.opendev.org/#/c/648912/14 but why are we looking up instance by name?
18:01:58 efried sean-k-mooney: I haven't the foggiest. I'm involved here in an administrative capacity :)
18:02:43 sean-k-mooney the domain xml has the instance uuid stored in the uuid field for several release now so im wondering why we would use the instance domain name instead
18:03:43 melwitt sean-k-mooney: been meaning to get to that review. after what we briefly discussed at the ptg, that change would be best to piggyback onto the existing periodic for handling deleted instances. I don't know why it would be looking up instances by name
18:03:45 efried k, hopefully yonglihe will be able to answer on the review. Thanks sean-k-mooney.
18:04:20 sean-k-mooney melwitt: it is piggybacking on that task
18:04:42 melwitt but I see a lot of new methods
18:04:42 sean-k-mooney efried: ok im trying to find where it gets teh list of suspected instances
18:05:16 sean-k-mooney melwitt: ya there are. im not sure if they are all needed.
18:05:36 melwitt yeah, I would think there should be none new needed
18:06:18 sean-k-mooney well we need a new method to query the driver for the instance that are runnin on the host but not in the db
18:07:07 sean-k-mooney and then we can call the old meethods that implement the policy. e.g. reap or log or do nothing whatever you have set in the config
18:07:31 melwitt why? there's a self._get_instances_on_driver method already
18:08:08 sean-k-mooney that is a good question :)
18:08:46 sean-k-mooney i have only properly looked at https://review.opendev.org/#/c/648912/14 whic is doing driver change but i need to look at how that is being used in https://review.opendev.org/#/c/627765/28
18:08:51 melwitt anyway, just saying skimming those patches I don't see why they're so large
18:09:29 melwitt or rather, I expected not such a large patch for this
18:13:16 sean-k-mooney melwitt: this seam oververly complex https://review.opendev.org/#/c/627765/28/nova/compute/manager.py@8884
18:14:32 sean-k-mooney also _destroy_orphan_instances is not a greate name for that since it might not destroy anything .
18:15:02 sean-k-mooney melwitt: i was wrong however its adding a new periodici task not extending the existing tasks
18:41:20 melwitt sean-k-mooney: yeah, that's what I had thought. it should get much simpler if it's changed to extend the existing periodic. and I think the suggestion at the ptg from dansmith was to add another enumerated choice to the existing conf option that is something like "reap-unknown" which will reap both known deleted and unknown guests. and otherwise just log unknowns
18:43:20 openstackgerrit Merged openstack/nova master: Fix update_provider_tree signature in reference docs https://review.opendev.org/667408
18:43:52 openstackgerrit Eric Fried proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487
18:44:18 mriedem sean-k-mooney: it's named _destroy* because it's similar to the existing _destroy_running_instances
18:44:21 mriedem which might also not destroy anything
18:44:38 sean-k-mooney mriedem: ah ok
18:45:12 sean-k-mooney my main issue with this is it will only really work for the libvirt dirver
18:45:17 sean-k-mooney and the fake driver
18:53:52 mriedem yup
18:54:01 mriedem i believe i noted something along those lines last time i did a deep review on it
18:54:07 openstackgerrit Miguel Ángel Herranz Trillo proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780
18:54:13 mriedem i seem to remember why they needed to lookup by name at the time, and it was libvirt-specific
18:54:35 sean-k-mooney ya im looking at it again
18:55:02 sean-k-mooney its because we can have libvirt domains that are for instance that nova created but have been deleted form the db
18:55:40 sean-k-mooney so to destory the running guest they need to use the libvirt domain name
18:55:58 mriedem having said all that, i'm not opposed to starting with something that could eventually be implemented by other drivers
18:56:07 mriedem as long as it's graceful about other drivers not implementing the necessary interface
18:56:31 sean-k-mooney ya it handels the not implmetned excetion correctly in the manager

Earlier   Later