| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-26 | |||
| 15:12:45 | dansmith | we store some stuff in nwinfo that isn't anywhere else, IIRC, like which ports we created vs. the user, so that has to be persisted somewhere if we were going to use memcache | |
| 15:12:49 | dansmith | ...yeah ;) | |
| 15:13:02 | dansmith | what problem is being solved here? | |
| 15:13:22 | mriedem | i don't think that overhauling to use an external cache service and restarting the compute is the giant hammer we really need for what we're trying to solve | |
| 15:13:29 | sean-k-mooney | nothing at the momemnt reworking it is unrelated to what we are trying to fix | |
| 15:13:40 | openstackgerrit | Eric Fried proposed openstack/nova master: Clean up orphan instances virt driver https://review.opendev.org/648912 | |
| 15:13:40 | openstackgerrit | Eric Fried proposed openstack/nova master: clean up orphan instances https://review.opendev.org/627765 | |
| 15:13:44 | mriedem | so this is a....thought exercise? | |
| 15:13:49 | efried | sean-k-mooney, gibi: Would y'all please have another look at these --^ | |
| 15:13:50 | sean-k-mooney | yes | |
| 15:14:11 | sean-k-mooney | its on my todo list to figure ot if it makes sense to even do | |
| 15:14:13 | gibi | efried: I have it open | |
| 15:15:17 | efried | thanks gibi | |
| 15:15:25 | efried | thanks sean-k-mooney | |
| 15:15:34 | efried | sean-k-mooney: fyi it's apparently a thing stx cares about | |
| 15:15:51 | efried | thus presumably it "makes sense" in some capacity :) | |
| 15:17:05 | mriedem | efried: hyperv ci is happy with the update_provider_tree patch https://review.opendev.org/#/c/667417/ | |
| 15:17:17 | efried | mriedem: thanks for the reminder | |
| 15:17:49 | mriedem | efried: fwiw that cleanup orphan instances thing is also something that the public cloud SIG (and huawei public cloud ops) care about as well, which i was initially reviewing it awhile back | |
| 15:18:01 | mriedem | *why i was | |
| 15:18:34 | mriedem | the concern at the last ptg was how much duplication there was with the existing periodic to cleanup running deleted (but not orphaned) instances | |
| 15:20:24 | efried | okay, thanks for that background. | |
| 15:21:13 | mriedem | something something live migration fails and you've got untracked guests on the host consuming resources (which aren't tracked obviously) so then trying to schedule things to those hosts fails b/c you're out of resources | |
| 15:21:39 | efried | sounds like we need a patch to clean up those orphaned instances | |
| 15:22:31 | mriedem | i'm sure lots of operators have already just written scripts to detect and clean those types of thing sup | |
| 15:22:33 | mriedem | *up | |
| 15:22:38 | mriedem | but yeah it's better to have it native probably | |
| 15:42:22 | efried | mriedem: We don't have a way to prove the xen one is being hit, do we? (update_provider_tree) | |
| 15:42:25 | efried | since their CI is dead? | |
| 15:43:54 | efried | mriedem: also, if you haven't already, there should be a note to the ML warning of this (and another before we remove the code path, obvsly) | |
| 15:44:06 | efried | ...for oot folk | |
| 15:44:43 | mriedem | sorry was just doing tech support with my mom | |
| 15:44:53 | efried | (I know nova_powervm is copacetic fwiw) | |
| 15:45:09 | mriedem | i was waiting to send the oot ML email until we were more sure about what i've proposed | |
| 15:45:22 | mriedem | and idk about the xen one if their CI is dead, though it's pretty damn basic | |
| 15:45:25 | mriedem | just a port of get_inventory | |
| 15:46:05 | bauzas | efried: mriedem: heh, the reportclient doesn't of course support all placement API queries, so I wonder whether I should add something like "get_resource_providers()" method in the reportclient just for nova-manage caller, or calling directly the Placement API | |
| 15:46:12 | bauzas | thoughts on that ? | |
| 15:47:03 | efried | bauzas: If it's something simple like GET /resource_providers (you really want all of them?) then yeah, just call SchedulerReportClient.get() | |
| 15:47:17 | bauzas | zactly | |
| 15:47:57 | efried | sfine | |
| 15:48:06 | bauzas | efried: but then I don't have a safe_connect connection | |
| 15:48:14 | mriedem | if you're not going to page, you could be listing 14K providers in the case of cern... | |
| 15:48:15 | efried | bauzas: We don't want @safe_connect | |
| 15:48:19 | efried | ever, anywhere | |
| 15:48:30 | efried | Handle ksa.ClientException at the caller instead. | |
| 15:48:45 | efried | And if you see @safe_connect anywhere in your travels and want to kill it and do that ^, I will buy your drivks. | |
| 15:48:47 | efried | drinks | |
| 15:48:58 | efried | true story | |
| 15:49:08 | bauzas | it's 40°C here, I'm all for a drink | |
| 15:49:15 | efried | bauzas: what are you trying to do with the master list? | |
| 15:49:32 | bauzas | efried: looking up all allocations to see whether they're orphaned | |
| 15:49:39 | bauzas | mriedem: ah shit, excellent point | |
| 15:50:08 | mriedem | you could instead page the compute nodes in the cells and hit this api https://developer.openstack.org/api-ref/placement/?expanded=#list-resource-provider-allocations | |
| 15:50:13 | bauzas | we could possibly need to look at all allocations per resource provider, which would be given by a list of compute services (which is paginated AFAIK) | |
| 15:50:31 | bauzas | heh, jinxed | |
| 15:50:32 | mriedem | compute service != compute node == resource provider | |
| 15:50:43 | bauzas | shit, typo, nodes indeed | |
| 15:50:52 | bauzas | tell me about my Kilo bp | |
| 15:51:38 | mriedem | so once you get the allocations for a given provider, what are you going to do? | |
| 15:51:50 | mriedem | check if an instance (or migration) exists with the given consumer uuid? | |
| 15:51:55 | mriedem | and if not, consider the allocation orphaned? | |
| 15:52:11 | mriedem | iff the allocation has resources that nova "owns" like VCPU | |
| 15:52:26 | mriedem | without consumer types in the allocations response we have to rely on the resource class | |
| 15:52:58 | bauzas | exactly this, I was about to say which resource classes where nova-related | |
| 15:53:07 | bauzas | were* | |
| 15:53:53 | efried | ugh, relying on resource class... | |
| 15:54:04 | efried | this is where the concept of provider owner would be handy. | |
| 15:54:17 | bauzas | yeah I know | |
| 15:54:32 | efried | hopefully we're not allowing allocations from different owners against the same provider anywhere | |
| 15:54:39 | bauzas | we could also add an argument asking for the resource class we wanna check | |
| 15:54:49 | efried | no, we shouldn't do it by resource class | |
| 15:55:02 | efried | because same resource class may be managed by different owners in different providers | |
| 15:55:20 | efried | think VF (nova-PCI vs cyborg vs neutron) | |
| 15:55:53 | efried | but we (need to make sure we) have a rule that a provider as a whole is only managed by a single owner. | |
| 15:56:25 | bauzas | hmmm | |
| 15:57:10 | bauzas | actually, I'm checking consumer_id | |
| 15:57:39 | bauzas | so I guess all resource providers corresponding to compute nodes (and children associated) should have allocations against consumer_id that | |
| 15:57:53 | bauzas | that *is* either a migration object or a nova instance | |
| 15:58:05 | bauzas | even cyborg, right? | |
| 15:58:31 | openstackgerrit | Nate Johnston proposed openstack/nova stable/stein: [DNM] Test change to check for port/instance project mismatch https://review.opendev.org/667663 | |
| 15:59:20 | bauzas | efried: ^? | |
| 16:00:45 | efried | bauzas: If what you're looking to do is clean up allocations against orphaned instances, I think it's legit to remove all the allocations associated with that consumer, even if they're on providers you don't own. That's symmetrical with what we do when we schedule (we claim all of those atomically from nova). | |
| 16:00:51 | efried | and | |
| 16:01:14 | efried | if there's an allocation against a compute node RP, you can legitimately assume it's in that category | |
| 16:01:15 | efried | but | |
| 16:01:33 | efried | that will break eventually if we ever have resourceless roots | |
| 16:01:34 | efried | because | |
| 16:01:48 | efried | you can *not* assume that all children of the compute node RP *also* belong to nova. | |
| 16:01:51 | bauzas | baby steps here :) | |
| 16:02:06 | efried | yeah, just leave a note/todo I guess. | |
| 16:02:21 | bauzas | at least if I can support nested rps, it would be cool | |
| 16:02:58 | bauzas | because eg. VGPU allocations are still made *against* a consumer which is an instance, yeepee | |
| 16:03:54 | bauzas | but, that would mean I would look at all resource providers, not only the ones Nova owns | |
| 16:03:56 | efried | yeah, it would be 1) compute node => 2) compute node RP => 3) allocations against that RP => 4) consumer for that allocation => 5) filter down to orphan consumers => 6) allocations for those consumers => 7) delete all of those | |
| 16:03:59 | efried | no | |
| 16:04:00 | bauzas | and here comes pagination... | |
| 16:04:28 | efried | with the limitation noted above (stops working for resourceless roots, which we're a long way off of), the above process will get you there. | |
| 16:04:36 | efried | Step 1 done by paginating from the nova API. | |
| 16:04:50 | bauzas | cool then | |