| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-26 | |||
| 15:42:22 | efried | mriedem: We don't have a way to prove the xen one is being hit, do we? (update_provider_tree) | |
| 15:42:25 | efried | since their CI is dead? | |
| 15:43:54 | efried | mriedem: also, if you haven't already, there should be a note to the ML warning of this (and another before we remove the code path, obvsly) | |
| 15:44:06 | efried | ...for oot folk | |
| 15:44:43 | mriedem | sorry was just doing tech support with my mom | |
| 15:44:53 | efried | (I know nova_powervm is copacetic fwiw) | |
| 15:45:09 | mriedem | i was waiting to send the oot ML email until we were more sure about what i've proposed | |
| 15:45:22 | mriedem | and idk about the xen one if their CI is dead, though it's pretty damn basic | |
| 15:45:25 | mriedem | just a port of get_inventory | |
| 15:46:05 | bauzas | efried: mriedem: heh, the reportclient doesn't of course support all placement API queries, so I wonder whether I should add something like "get_resource_providers()" method in the reportclient just for nova-manage caller, or calling directly the Placement API | |
| 15:46:12 | bauzas | thoughts on that ? | |
| 15:47:03 | efried | bauzas: If it's something simple like GET /resource_providers (you really want all of them?) then yeah, just call SchedulerReportClient.get() | |
| 15:47:17 | bauzas | zactly | |
| 15:47:57 | efried | sfine | |
| 15:48:06 | bauzas | efried: but then I don't have a safe_connect connection | |
| 15:48:14 | mriedem | if you're not going to page, you could be listing 14K providers in the case of cern... | |
| 15:48:15 | efried | bauzas: We don't want @safe_connect | |
| 15:48:19 | efried | ever, anywhere | |
| 15:48:30 | efried | Handle ksa.ClientException at the caller instead. | |
| 15:48:45 | efried | And if you see @safe_connect anywhere in your travels and want to kill it and do that ^, I will buy your drivks. | |
| 15:48:47 | efried | drinks | |
| 15:48:58 | efried | true story | |
| 15:49:08 | bauzas | it's 40°C here, I'm all for a drink | |
| 15:49:15 | efried | bauzas: what are you trying to do with the master list? | |
| 15:49:32 | bauzas | efried: looking up all allocations to see whether they're orphaned | |
| 15:49:39 | bauzas | mriedem: ah shit, excellent point | |
| 15:50:08 | mriedem | you could instead page the compute nodes in the cells and hit this api https://developer.openstack.org/api-ref/placement/?expanded=#list-resource-provider-allocations | |
| 15:50:13 | bauzas | we could possibly need to look at all allocations per resource provider, which would be given by a list of compute services (which is paginated AFAIK) | |
| 15:50:31 | bauzas | heh, jinxed | |
| 15:50:32 | mriedem | compute service != compute node == resource provider | |
| 15:50:43 | bauzas | shit, typo, nodes indeed | |
| 15:50:52 | bauzas | tell me about my Kilo bp | |
| 15:51:38 | mriedem | so once you get the allocations for a given provider, what are you going to do? | |
| 15:51:50 | mriedem | check if an instance (or migration) exists with the given consumer uuid? | |
| 15:51:55 | mriedem | and if not, consider the allocation orphaned? | |
| 15:52:11 | mriedem | iff the allocation has resources that nova "owns" like VCPU | |
| 15:52:26 | mriedem | without consumer types in the allocations response we have to rely on the resource class | |
| 15:52:58 | bauzas | exactly this, I was about to say which resource classes where nova-related | |
| 15:53:07 | bauzas | were* | |
| 15:53:53 | efried | ugh, relying on resource class... | |
| 15:54:04 | efried | this is where the concept of provider owner would be handy. | |
| 15:54:17 | bauzas | yeah I know | |
| 15:54:32 | efried | hopefully we're not allowing allocations from different owners against the same provider anywhere | |
| 15:54:39 | bauzas | we could also add an argument asking for the resource class we wanna check | |
| 15:54:49 | efried | no, we shouldn't do it by resource class | |
| 15:55:02 | efried | because same resource class may be managed by different owners in different providers | |
| 15:55:20 | efried | think VF (nova-PCI vs cyborg vs neutron) | |
| 15:55:53 | efried | but we (need to make sure we) have a rule that a provider as a whole is only managed by a single owner. | |
| 15:56:25 | bauzas | hmmm | |
| 15:57:10 | bauzas | actually, I'm checking consumer_id | |
| 15:57:39 | bauzas | so I guess all resource providers corresponding to compute nodes (and children associated) should have allocations against consumer_id that | |
| 15:57:53 | bauzas | that *is* either a migration object or a nova instance | |
| 15:58:05 | bauzas | even cyborg, right? | |
| 15:58:31 | openstackgerrit | Nate Johnston proposed openstack/nova stable/stein: [DNM] Test change to check for port/instance project mismatch https://review.opendev.org/667663 | |
| 15:59:20 | bauzas | efried: ^? | |
| 16:00:45 | efried | bauzas: If what you're looking to do is clean up allocations against orphaned instances, I think it's legit to remove all the allocations associated with that consumer, even if they're on providers you don't own. That's symmetrical with what we do when we schedule (we claim all of those atomically from nova). | |
| 16:00:51 | efried | and | |
| 16:01:14 | efried | if there's an allocation against a compute node RP, you can legitimately assume it's in that category | |
| 16:01:15 | efried | but | |
| 16:01:33 | efried | that will break eventually if we ever have resourceless roots | |
| 16:01:34 | efried | because | |
| 16:01:48 | efried | you can *not* assume that all children of the compute node RP *also* belong to nova. | |
| 16:01:51 | bauzas | baby steps here :) | |
| 16:02:06 | efried | yeah, just leave a note/todo I guess. | |
| 16:02:21 | bauzas | at least if I can support nested rps, it would be cool | |
| 16:02:58 | bauzas | because eg. VGPU allocations are still made *against* a consumer which is an instance, yeepee | |
| 16:03:54 | bauzas | but, that would mean I would look at all resource providers, not only the ones Nova owns | |
| 16:03:56 | efried | yeah, it would be 1) compute node => 2) compute node RP => 3) allocations against that RP => 4) consumer for that allocation => 5) filter down to orphan consumers => 6) allocations for those consumers => 7) delete all of those | |
| 16:03:59 | efried | no | |
| 16:04:00 | bauzas | and here comes pagination... | |
| 16:04:28 | efried | with the limitation noted above (stops working for resourceless roots, which we're a long way off of), the above process will get you there. | |
| 16:04:36 | efried | Step 1 done by paginating from the nova API. | |
| 16:04:50 | bauzas | cool then | |
| 16:05:04 | efried | this is in a nova-manage type utility? | |
| 16:05:14 | efried | So we don't care that it'll take FOREVER to run at cern? | |
| 16:05:30 | bauzas | a nova-manage placement audit thing | |
| 16:05:38 | efried | mm | |
| 16:05:39 | bauzas | so a cron job basically | |
| 16:05:49 | bauzas | marker and the likes | |
| 16:06:00 | efried | mm | |
| 16:06:04 | bauzas | zactly like heal_allocations | |
| 16:06:13 | efried | sure would be nice to find a way to make it more efficient then. | |
| 16:06:31 | mriedem | heal_allocations doesn't have a marker | |
| 16:06:38 | efried | but: make it, make it right, make it fast | |
| 16:06:41 | mriedem | it has a limit of things to process | |
| 16:06:55 | mriedem | nor does heal_allocations deal with nested allocations | |
| 16:08:42 | mriedem | the audit command could also take a --consumer option to just investigate what the operator thinks is a problem instance/migration | |
| 16:09:00 | mriedem | note that i added --instance to heal_allocations later for that reason | |
| 16:09:07 | bauzas | yup I saw | |
| 16:09:10 | mriedem | and --dry-run | |
| 16:09:31 | mriedem | depends on what the command will do though, if it's just reporting then you don't need a --dry-run | |
| 16:11:20 | bauzas | I was thinking of just telling the orphaned, but then later adding a --remove option | |
| 16:11:33 | bauzas | *later* | |
| 16:12:35 | bauzas | anyway, needs to go off and run by hot summer nights | |
| 16:13:03 | bauzas | I think I have everything I needed, thanks folks | |
| 16:14:57 | dansmith | hoo boy | |
| 16:15:23 | mriedem | strictly adult cheese, wine and things of that nature | |
| 16:16:35 | melwitt | now, for another fun topic | |
| 16:17:02 | melwitt | mriedem, dansmith: I was reading these comments on an old [unmerged] patch: https://review.opendev.org/#/c/462521/12/releasenotes/notes/resize-auto-revert-6e1648828aba16b2.yaml@5, | |
| 16:17:31 | melwitt | and it made me think of the [recently merged] patch: https://review.opendev.org/633227 again and how it changed ERROR state to ACTIVE (or STOPPED) state. now I'm worried that wasn't an ok thing to do (API change?) | |