| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-26 | |||
| 16:58:20 | sean-k-mooney | so ya looking at code earch revert_resize is only ever called form the api which simplifes some things but not others | |
| 16:59:28 | sean-k-mooney | melwitt: do we currently allow you to revert a resize for an instance that is in error because the resize failed | |
| 16:59:58 | sean-k-mooney | so you can go active->resizeing->error->active? | |
| 17:00:36 | mriedem | fwiw, as a non-admin i think you can tell if your resize failed if the instance action "message" is not null /servers/{server_id}/os-instance-actions/{request_id} | |
| 17:00:40 | mriedem | er GET /servers/{server_id}/os-instance-actions/{request_id} | |
| 17:00:43 | melwitt | sean-k-mooney: I think so, based on the abandoned patch. it was proposing to do that automatically (from error) | |
| 17:01:37 | sean-k-mooney | melwitt: ok if we did not you would have to do rest state (which is admin only?) + a hard reboot | |
| 17:02:12 | melwitt | mriedem: you mean failed before resize started right | |
| 17:02:29 | mriedem | no if the operation failed | |
| 17:02:49 | mriedem | if any event in an action (operation) fails, the overall action 'message' is always 'Error': https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/db/sqlalchemy/api.py#L5227 | |
| 17:02:53 | melwitt | sean-k-mooney: if we did not allow revert from error? I don't think reset state + reboot would put everything back properly | |
| 17:02:56 | mriedem | which is actually a bug... | |
| 17:03:18 | mriedem | https://bugs.launchpad.net/nova/+bug/1824420 | |
| 17:03:19 | openstack | Launchpad bug 1824420 in OpenStack Compute (nova) "Live migration succeeds but instance-action-list still has unexpected Error status" [Undecided,Triaged] | |
| 17:03:44 | melwitt | oh | |
| 17:04:30 | mriedem | so before we go down the road of "well the user can track the operatoin to see if it was auto-reverted on error because of instance actions" let me point out that relying on instance actions that way isn't fool proof because of that bug | |
| 17:04:43 | mriedem | and especially b/c it's a result of failures on hosts and then doing reschedules to other hosts | |
| 17:04:45 | mriedem | which resize can do | |
| 17:04:52 | sean-k-mooney | the instace should become active on the source host but it might not fix the allocation in placment properly | |
| 17:05:24 | mriedem | auto-reverting a failed resize could be all sorts of f'ed up | |
| 17:05:29 | mriedem | because rollbacks are near impossible | |
| 17:05:40 | mriedem | hard to test | |
| 17:05:58 | mriedem | i'm fairly certain our live migration rollback code is also quite janky in several ways | |
| 17:06:03 | mriedem | because we don't test it in the gate | |
| 17:08:07 | sean-k-mooney | just looking at that bug the live migration failed right? | |
| 17:09:07 | sean-k-mooney | so we would exepct there to be an error in the instance action log? | |
| 17:10:06 | mriedem | no | |
| 17:10:10 | mriedem | read my comments on the bug | |
| 17:10:29 | sean-k-mooney | maybe im missreading it as its kind of hard to read the initilal bug | |
| 17:10:31 | mriedem | a pre-check on one of the candidate dest hosts failed | |
| 17:10:44 | mriedem | which triggers a reschedule to another dest host in the conductor live migration task | |
| 17:10:49 | mriedem | the 2nd host works | |
| 17:11:17 | mriedem | but b/c the pre-check failed on the first dest host the instance action event for that one is error which sets the overall action message to 'Error' | |
| 17:11:36 | sean-k-mooney | ah ok | |
| 17:11:46 | mriedem | iow, actions aren't reschedule aware | |
| 17:11:56 | mriedem | or what is a non-fatal error | |
| 17:12:05 | sean-k-mooney | right | |
| 17:12:27 | sean-k-mooney | should we be loging the prechecks as events? | |
| 17:12:59 | sean-k-mooney | i was not exepcting to see compute_check_can_live_migrate_destination events | |
| 17:13:25 | mriedem | hard to say | |
| 17:13:33 | mriedem | if you configure nova to not have alternate hosts for reschedules | |
| 17:13:46 | mriedem | then you'd likely want to know it's that dest pre-check that failed right? | |
| 17:14:18 | sean-k-mooney | maybe or jsut that you had no valid hosts? | |
| 17:14:29 | sean-k-mooney | / exausted teh list of alternates | |
| 17:14:57 | sean-k-mooney | i thikn we would still log the failure right | |
| 17:15:02 | openstackgerrit | Merged openstack/nova master: Remove orphaned comment from _get_group_details https://review.opendev.org/667135 | |
| 17:15:07 | mriedem | sure, if you set https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.migrate_max_retries to 0 for now retries, or you don't have any alternate hosts | |
| 17:15:44 | mriedem | idk, anyway, it's tangential to the auto-revert failed resize thing mel was asking about | |
| 17:15:45 | sean-k-mooney | for me this feels like we are leaking an implemenation detail as an event | |
| 17:16:04 | sean-k-mooney | ya it is | |
| 17:16:11 | mriedem | instance action events are basically entirely leaked implementation details :) | |
| 17:16:17 | mriedem | the event names come from the methods they decorate | |
| 17:16:27 | mriedem | there is no guarantee on api stability for those things | |
| 17:17:17 | sean-k-mooney | ok personally i would prefer not to decorate that function | |
| 17:17:45 | sean-k-mooney | but as you said its tangental to melwitt's topic | |
| 17:36:44 | openstackgerrit | Lee Yarwood proposed openstack/nova master: libvirt: Add a rbd_connect_timeout configurable https://review.opendev.org/667421 | |
| 17:36:57 | openstackgerrit | Eric Fried proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487 | |
| 17:38:12 | Nick_A | any idea why metadata would send all /24 routes in a region to each instance? http://paste.openstack.org/show/y0lE42EA59yhnu7G1KnY/ | |
| 17:38:25 | openstackgerrit | Matt Riedemann proposed openstack/nova master: Fix AttributeError in RT._update_usage_from_migration https://review.opendev.org/667687 | |
| 17:38:26 | openstackgerrit | Matt Riedemann proposed openstack/nova master: Fix RT init arg order in test_unsupported_move_type https://review.opendev.org/667688 | |
| 17:48:03 | openstackgerrit | Ghanshyam Mann proposed openstack/nova master: Multiple API cleanup changes https://review.opendev.org/666889 | |
| 17:59:52 | sean-k-mooney | yonglihe: efried just reviewing https://review.opendev.org/#/c/648912/14 but why are we looking up instance by name? | |
| 18:01:58 | efried | sean-k-mooney: I haven't the foggiest. I'm involved here in an administrative capacity :) | |
| 18:02:43 | sean-k-mooney | the domain xml has the instance uuid stored in the uuid field for several release now so im wondering why we would use the instance domain name instead | |
| 18:03:43 | melwitt | sean-k-mooney: been meaning to get to that review. after what we briefly discussed at the ptg, that change would be best to piggyback onto the existing periodic for handling deleted instances. I don't know why it would be looking up instances by name | |
| 18:03:45 | efried | k, hopefully yonglihe will be able to answer on the review. Thanks sean-k-mooney. | |
| 18:04:20 | sean-k-mooney | melwitt: it is piggybacking on that task | |
| 18:04:42 | melwitt | but I see a lot of new methods | |
| 18:04:42 | sean-k-mooney | efried: ok im trying to find where it gets teh list of suspected instances | |
| 18:05:16 | sean-k-mooney | melwitt: ya there are. im not sure if they are all needed. | |
| 18:05:36 | melwitt | yeah, I would think there should be none new needed | |
| 18:06:18 | sean-k-mooney | well we need a new method to query the driver for the instance that are runnin on the host but not in the db | |
| 18:07:07 | sean-k-mooney | and then we can call the old meethods that implement the policy. e.g. reap or log or do nothing whatever you have set in the config | |
| 18:07:31 | melwitt | why? there's a self._get_instances_on_driver method already | |
| 18:08:08 | sean-k-mooney | that is a good question :) | |
| 18:08:46 | sean-k-mooney | i have only properly looked at https://review.opendev.org/#/c/648912/14 whic is doing driver change but i need to look at how that is being used in https://review.opendev.org/#/c/627765/28 | |
| 18:08:51 | melwitt | anyway, just saying skimming those patches I don't see why they're so large | |
| 18:09:29 | melwitt | or rather, I expected not such a large patch for this | |
| 18:13:16 | sean-k-mooney | melwitt: this seam oververly complex https://review.opendev.org/#/c/627765/28/nova/compute/manager.py@8884 | |
| 18:14:32 | sean-k-mooney | also _destroy_orphan_instances is not a greate name for that since it might not destroy anything . | |
| 18:15:02 | sean-k-mooney | melwitt: i was wrong however its adding a new periodici task not extending the existing tasks | |
| 18:41:20 | melwitt | sean-k-mooney: yeah, that's what I had thought. it should get much simpler if it's changed to extend the existing periodic. and I think the suggestion at the ptg from dansmith was to add another enumerated choice to the existing conf option that is something like "reap-unknown" which will reap both known deleted and unknown guests. and otherwise just log unknowns | |
| 18:43:20 | openstackgerrit | Merged openstack/nova master: Fix update_provider_tree signature in reference docs https://review.opendev.org/667408 | |
| 18:43:52 | openstackgerrit | Eric Fried proposed openstack/nova-specs master: grammar fix for show-server-numa-topology spec https://review.opendev.org/667487 | |
| 18:44:18 | mriedem | sean-k-mooney: it's named _destroy* because it's similar to the existing _destroy_running_instances | |
| 18:44:21 | mriedem | which might also not destroy anything | |
| 18:44:38 | sean-k-mooney | mriedem: ah ok | |
| 18:45:12 | sean-k-mooney | my main issue with this is it will only really work for the libvirt dirver | |
| 18:45:17 | sean-k-mooney | and the fake driver | |
| 18:53:52 | mriedem | yup | |
| 18:54:01 | mriedem | i believe i noted something along those lines last time i did a deep review on it | |
| 18:54:07 | openstackgerrit | Miguel Ángel Herranz Trillo proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780 | |
| 18:54:13 | mriedem | i seem to remember why they needed to lookup by name at the time, and it was libvirt-specific | |
| 18:54:35 | sean-k-mooney | ya im looking at it again | |
| 18:55:02 | sean-k-mooney | its because we can have libvirt domains that are for instance that nova created but have been deleted form the db | |
| 18:55:40 | sean-k-mooney | so to destory the running guest they need to use the libvirt domain name | |
| 18:55:58 | mriedem | having said all that, i'm not opposed to starting with something that could eventually be implemented by other drivers | |
| 18:56:07 | mriedem | as long as it's graceful about other drivers not implementing the necessary interface | |
| 18:56:31 | sean-k-mooney | ya it handels the not implmetned excetion correctly in the manager | |
| 18:56:58 | sean-k-mooney | im think of asking them to use the uuid instead of the name however | |