Earlier  
Posted Nick Remark
#openstack-nova - 2019-06-26
15:44:43 mriedem sorry was just doing tech support with my mom
15:44:53 efried (I know nova_powervm is copacetic fwiw)
15:45:09 mriedem i was waiting to send the oot ML email until we were more sure about what i've proposed
15:45:22 mriedem and idk about the xen one if their CI is dead, though it's pretty damn basic
15:45:25 mriedem just a port of get_inventory
15:46:05 bauzas efried: mriedem: heh, the reportclient doesn't of course support all placement API queries, so I wonder whether I should add something like "get_resource_providers()" method in the reportclient just for nova-manage caller, or calling directly the Placement API
15:46:12 bauzas thoughts on that ?
15:47:03 efried bauzas: If it's something simple like GET /resource_providers (you really want all of them?) then yeah, just call SchedulerReportClient.get()
15:47:17 bauzas zactly
15:47:57 efried sfine
15:48:06 bauzas efried: but then I don't have a safe_connect connection
15:48:14 mriedem if you're not going to page, you could be listing 14K providers in the case of cern...
15:48:15 efried bauzas: We don't want @safe_connect
15:48:19 efried ever, anywhere
15:48:30 efried Handle ksa.ClientException at the caller instead.
15:48:45 efried And if you see @safe_connect anywhere in your travels and want to kill it and do that ^, I will buy your drivks.
15:48:47 efried drinks
15:48:58 efried true story
15:49:08 bauzas it's 40°C here, I'm all for a drink
15:49:15 efried bauzas: what are you trying to do with the master list?
15:49:32 bauzas efried: looking up all allocations to see whether they're orphaned
15:49:39 bauzas mriedem: ah shit, excellent point
15:50:08 mriedem you could instead page the compute nodes in the cells and hit this api https://developer.openstack.org/api-ref/placement/?expanded=#list-resource-provider-allocations
15:50:13 bauzas we could possibly need to look at all allocations per resource provider, which would be given by a list of compute services (which is paginated AFAIK)
15:50:31 bauzas heh, jinxed
15:50:32 mriedem compute service != compute node == resource provider
15:50:43 bauzas shit, typo, nodes indeed
15:50:52 bauzas tell me about my Kilo bp
15:51:38 mriedem so once you get the allocations for a given provider, what are you going to do?
15:51:50 mriedem check if an instance (or migration) exists with the given consumer uuid?
15:51:55 mriedem and if not, consider the allocation orphaned?
15:52:11 mriedem iff the allocation has resources that nova "owns" like VCPU
15:52:26 mriedem without consumer types in the allocations response we have to rely on the resource class
15:52:58 bauzas exactly this, I was about to say which resource classes where nova-related
15:53:07 bauzas were*
15:53:53 efried ugh, relying on resource class...
15:54:04 efried this is where the concept of provider owner would be handy.
15:54:17 bauzas yeah I know
15:54:32 efried hopefully we're not allowing allocations from different owners against the same provider anywhere
15:54:39 bauzas we could also add an argument asking for the resource class we wanna check
15:54:49 efried no, we shouldn't do it by resource class
15:55:02 efried because same resource class may be managed by different owners in different providers
15:55:20 efried think VF (nova-PCI vs cyborg vs neutron)
15:55:53 efried but we (need to make sure we) have a rule that a provider as a whole is only managed by a single owner.
15:56:25 bauzas hmmm
15:57:10 bauzas actually, I'm checking consumer_id
15:57:39 bauzas so I guess all resource providers corresponding to compute nodes (and children associated) should have allocations against consumer_id that
15:57:53 bauzas that *is* either a migration object or a nova instance
15:58:05 bauzas even cyborg, right?
15:58:31 openstackgerrit Nate Johnston proposed openstack/nova stable/stein: [DNM] Test change to check for port/instance project mismatch https://review.opendev.org/667663
15:59:20 bauzas efried: ^?
16:00:45 efried bauzas: If what you're looking to do is clean up allocations against orphaned instances, I think it's legit to remove all the allocations associated with that consumer, even if they're on providers you don't own. That's symmetrical with what we do when we schedule (we claim all of those atomically from nova).
16:00:51 efried and
16:01:14 efried if there's an allocation against a compute node RP, you can legitimately assume it's in that category
16:01:15 efried but
16:01:33 efried that will break eventually if we ever have resourceless roots
16:01:34 efried because
16:01:48 efried you can *not* assume that all children of the compute node RP *also* belong to nova.
16:01:51 bauzas baby steps here :)
16:02:06 efried yeah, just leave a note/todo I guess.
16:02:21 bauzas at least if I can support nested rps, it would be cool
16:02:58 bauzas because eg. VGPU allocations are still made *against* a consumer which is an instance, yeepee
16:03:54 bauzas but, that would mean I would look at all resource providers, not only the ones Nova owns
16:03:56 efried yeah, it would be 1) compute node => 2) compute node RP => 3) allocations against that RP => 4) consumer for that allocation => 5) filter down to orphan consumers => 6) allocations for those consumers => 7) delete all of those
16:03:59 efried no
16:04:00 bauzas and here comes pagination...
16:04:28 efried with the limitation noted above (stops working for resourceless roots, which we're a long way off of), the above process will get you there.
16:04:36 efried Step 1 done by paginating from the nova API.
16:04:50 bauzas cool then
16:05:04 efried this is in a nova-manage type utility?
16:05:14 efried So we don't care that it'll take FOREVER to run at cern?
16:05:30 bauzas a nova-manage placement audit thing
16:05:38 efried mm
16:05:39 bauzas so a cron job basically
16:05:49 bauzas marker and the likes
16:06:00 efried mm
16:06:04 bauzas zactly like heal_allocations
16:06:13 efried sure would be nice to find a way to make it more efficient then.
16:06:31 mriedem heal_allocations doesn't have a marker
16:06:38 efried but: make it, make it right, make it fast
16:06:41 mriedem it has a limit of things to process
16:06:55 mriedem nor does heal_allocations deal with nested allocations
16:08:42 mriedem the audit command could also take a --consumer option to just investigate what the operator thinks is a problem instance/migration
16:09:00 mriedem note that i added --instance to heal_allocations later for that reason
16:09:07 bauzas yup I saw
16:09:10 mriedem and --dry-run
16:09:31 mriedem depends on what the command will do though, if it's just reporting then you don't need a --dry-run
16:11:20 bauzas I was thinking of just telling the orphaned, but then later adding a --remove option
16:11:33 bauzas *later*
16:12:35 bauzas anyway, needs to go off and run by hot summer nights
16:13:03 bauzas I think I have everything I needed, thanks folks
16:14:57 dansmith hoo boy
16:15:23 mriedem strictly adult cheese, wine and things of that nature
16:16:35 melwitt now, for another fun topic
16:17:02 melwitt mriedem, dansmith: I was reading these comments on an old [unmerged] patch: https://review.opendev.org/#/c/462521/12/releasenotes/notes/resize-auto-revert-6e1648828aba16b2.yaml@5,
16:17:31 melwitt and it made me think of the [recently merged] patch: https://review.opendev.org/633227 again and how it changed ERROR state to ACTIVE (or STOPPED) state. now I'm worried that wasn't an ok thing to do (API change?)
16:18:12 melwitt for a failed cold migration to self
16:18:21 mriedem not the same
16:18:27 mriedem in my change,
16:18:35 mriedem we failed in prep_resize before we actually did anything to the guest

Earlier   Later