| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-26 | |||
| 19:09:11 | sean-k-mooney | its tricky however as from the domain we dont know what nova create or not but i think we should at lesast try to do some cleanup | |
| 19:12:52 | sean-k-mooney | anyway im going to get something to eat | |
| 19:13:05 | sean-k-mooney | o/ | |
| 19:19:29 | mriedem | there is meta in the domain that tells us if nova created it | |
| 19:19:42 | sean-k-mooney | yes there is | |
| 19:19:48 | mriedem | for libvirt anyway | |
| 19:20:05 | sean-k-mooney | which i asked yonglihe to check when deleteing the domian | |
| 19:20:13 | sean-k-mooney | in the libvirt case that is | |
| 19:20:48 | efried | mriedem: +1 on the ML note, thanks for that | |
| 19:21:24 | mriedem | np | |
| 19:22:36 | sean-k-mooney | that is what https://review.opendev.org/#/c/648912/14/nova/virt/libvirt/driver.py@9695 is doing and its used to filer the domains here https://review.opendev.org/#/c/627765/28/nova/compute/manager.py@8886 | |
| 19:59:33 | efried | mriedem: took me three days to go through all the specs, but http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007381.html | |
| 20:10:01 | mriedem | efried: huh https://review.opendev.org/#/q/project:openstack/nova-specs+status:open+path:%255Especs/train/approved/.*+reviewedby:self seems to not work properly, it's only showing me https://review.opendev.org/#/c/631154/ but i've clearly commented on https://review.opendev.org/#/c/648686/ as well | |
| 20:10:17 | mriedem | maybe reviewedby is only on the latest PS? | |
| 20:10:55 | mriedem | aha | |
| 20:10:56 | mriedem | https://review.opendev.org/#/q/project:openstack/nova-specs+status:open+path:%255Especs/train/approved/.*+reviewer:self | |
| 20:11:00 | mriedem | reviewer:self, not reviewedby:self | |
| 20:14:14 | efried | whoops, thanks | |
| 21:31:07 | melwitt | mnaser, imacdonn: hi, as responders to the ML thread awhile back, I have a spec up for showing server status as UNKNOWN if host status is UNKNOWN that has been receiving some reviews. your reviews would be helpful for deciding whether it goes forward https://review.opendev.org/666181 | |
| 21:33:10 | mriedem | dansmith: you may want to drop your +2 to a +1 or 0 https://review.opendev.org/#/c/457886/ until i get the ceph job results on it | |
| 21:41:44 | openstackgerrit | Miguel Ángel Herranz Trillo proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780 | |
| 21:44:11 | mriedem | dansmith: nvm, lyarwood already had a patch up to test that | |
| 22:14:40 | mriedem | efried: are you ok with me just pushing up this test change and +2ing? https://review.opendev.org/#/c/659780/3/nova/tests/unit/virt/disk/mount/test_nbd.py | |
| 22:14:51 | efried | ... | |
| 22:15:58 | efried | mriedem: yes, and I'll +A | |
| 22:16:52 | openstackgerrit | Matt Riedemann proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780 | |
| 22:17:37 | mriedem | done | |
| 22:18:58 | efried | +A | |
| 22:25:39 | mnaser | melwitt: left a comment thanks :D | |
| 22:43:44 | melwitt | thanks | |
| #openstack-nova - 2019-06-27 | |||
| 13:13:51 | gibi | mriedem: the previous patches just refactored the heal allocation code path | |
| 13:14:10 | gibi | to make it easy to add the port healing | |
| 13:15:30 | gibi | mriedem: the basic structure of final patch | |
| 13:18:00 | gibi | mriedem: during _heal_allocations_for_instance first we get if the instance has ports that needs healing | |
| 13:18:26 | gibi | this _get_port_allocations_to_heal is the majority of the added code there | |
| 13:19:01 | gibi | as it detect is there is any port that needs healing and generates the allocation fragment for that port | |
| 13:19:33 | mriedem | ok i'm assuming just checking if the port has a resource request (if hitting neutron directly) or checking the info cache for a bound port with the allocatoin in the binding profile? | |
| 13:19:52 | gibi | we are hitting neutron directly | |
| 13:20:02 | mriedem | hmm, that's kind of expensive isn't it? | |
| 13:20:19 | mriedem | why can't we use the info cache to tell if the vif has a port allocation? | |
| 13:20:49 | mriedem | or is that some kind of chicken and egg issue? or we don't trust the cache? | |
| 13:20:55 | gibi | we need to hit neutron to see if there is resource request on the port | |
| 13:21:00 | gibi | that is not part of the vif cache | |
| 13:21:23 | mriedem | sure but the binding profile in the vif cache has a thing that tells us if it had a resource request right? | |
| 13:21:44 | gibi | mriedem: now, the binding profile in the cache only could tell us if there is allocation for the port | |
| 13:21:52 | openstackgerrit | Eric Fried proposed openstack/nova-specs master: WIP: Physical TPM passthrough https://review.opendev.org/667926 | |
| 13:21:57 | gibi | but if there is no allocation then we don't know if there is a need for such allocation | |
| 13:22:48 | mriedem | so let's say the binding profile in the vif cache doesn't have that allocation marker, but we find a port with a resource request - we're going to heal the allocation by putting allocations to placement for that instance right? | |
| 13:23:03 | mriedem | are we also going to update the cache to indicate it's now got an allocatoin? | |
| 13:23:14 | gibi | mriedem: right, and also updating neutron with the rp uuid | |
| 13:23:27 | gibi | mriedem: hm, we are not updating the info cache | |
| 13:23:29 | gibi | :/ | |
| 13:24:16 | mriedem | yeah, so i'm wondering about the scenario that our cache is busted, we go tell placement there is an allocation, but then what happens when we teardown that port/server in the compute service? are we basing that on the cache? b/c if so, we'll leak the allocation correcT? | |
| 13:24:25 | openstackgerrit | Eric Fried proposed openstack/nova master: WIP: Physical TPM passthrough https://review.opendev.org/667928 | |
| 13:25:07 | mriedem | similarly are we only creating the port allocation in heal_allocations if the port is bound? | |
| 13:25:13 | gibi | mriedem: we cannot leak allocation during delete as delete deletes everything for the consumer | |
| 13:25:28 | mriedem | gibi: what about when we detach the port though? | |
| 13:25:31 | mriedem | but don't delete the server | |
| 13:26:31 | gibi | I hope that code path works based on the allocation key in the binding profile, but let me check | |
| 13:26:57 | mriedem | https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L1672 | |
| 13:27:00 | mriedem | looks like it does... | |
| 13:27:08 | gibi | mriedem: in the other hand, is there any way to trigger a info_cache refresh or we need to do it manually? | |
| 13:28:33 | mriedem | there is a periodic task in the compute service that will forcefully rebuild the cache from info in neutron, | |
| 13:28:44 | mriedem | but i don't think that code is doing anything with the resource request / allocation information yet | |
| 13:28:55 | mriedem | might have been something i brought up when reviewing the series in stein | |
| 13:29:30 | mriedem | this is the call from the compute task https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/compute/manager.py#L7445 | |
| 13:30:20 | mriedem | this is the neutronv2 api method that builds the vif object in the cache https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L2842 | |
| 13:30:35 | mriedem | we copy the binding:profile directly off the port https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L2887 | |
| 13:31:23 | mriedem | maybe that's enough? the 'allocation' is in the binding:profile yeah? and that maps to the resource provider uuid on which the allocation belongs | |
| 13:31:53 | gibi | copying binding:profile is enough as it has the 'allocaton' key | |
| 13:32:06 | gibi | and that points to the RP | |
| 13:32:10 | gibi | we allocate from | |
| 13:32:54 | gibi | mriedem: this is where the heal port allocation makes sure that neutron binding:profile is updated https://review.opendev.org/#/c/637955/28/nova/cmd/manage.py@1857 | |
| 13:33:19 | openstackgerrit | Merged openstack/nova stable/stein: libvirt: Rework 'EBUSY' (SIGKILL) error handling code path https://review.opendev.org/667389 | |
| 13:34:22 | mriedem | ok so going back to my original question, cache vs neutron directly, i guess heal_allocations is going to the source of truth first (neutron) and working based on that, which is more fool-proof if possibly less performant | |
| 13:34:53 | mriedem | if we healed based on the cache, and the cache was stale or missing information somehow, then we would fail to heal allocations for the ports in that busted cache | |
| 13:35:16 | gibi | mriedem: there is no way to figure out what to heal if we don't hit neutron. the info cache only good to see if there is allocation or not for a port, but only neutron knows if the port needs allocation | |
| 13:35:19 | mriedem | with the way you have it, heal_allocations would heal the allocations and the port binding profile and the _heal_instance_info_cache task in the compute service running that instance would heal the cache | |
| 13:35:31 | mriedem | sure, i realize that, | |
| 13:35:40 | mriedem | i was thinking of using the cache to pre-filter the list of ports we're asking neutron about | |
| 13:36:05 | mriedem | i.e. if we're iterating over 50 intances that's 50 calls to neutron | |
| 13:36:21 | gibi | mriedem: you mean if we see allocaton key in the cache then we assume that such port doesn't need to be healed? | |
| 13:36:22 | mriedem | i do like that you added an option to skip this though | |
| 13:37:04 | mriedem | gibi: something like that | |
| 13:37:41 | mriedem | if we would have stored some flag in the vif cache originally that indicated the port had resource requests (not necessarily the resource request itself), heal_allocations could have just checked that flag in the cache and then call neutron for that port | |
| 13:38:28 | mriedem | and if ifs and buts were candy and nuts we'd all have a very fine christmas | |
| 13:38:55 | gibi | mriedem: these vifs are created _before_ there was any logic in nova about bandwidth resource. So I don't see how to have a flag in nova about it | |
| 13:39:20 | mriedem | oh right this is also for healing the case of ports attached with qos before stein... | |
| 13:39:22 | mriedem | i forgot about that | |
| 13:39:41 | gibi | this is the case exaclty that the port was attached without allocation | |
| 13:39:54 | mriedem | anyway, i'm trying to optimize something that we just want to work on the first go, so i can drop it | |
| 13:39:57 | gibi | because neutron had support way earlier for bandiwdth | |
| 13:40:50 | mriedem | the other good thing is that heal_allocations now has the option to heal a single instance so if an operator is trying to fix one specific instance (b/c of some user ticket or something) then they can just target that one rather than all instances to heal 1 | |
| 13:41:40 | gibi | yeah, I remember I had to adapt to that improvement at some point | |
| 13:42:52 | gibi | one extra complexity to note. as we need to update both placement and neutron the update cannot be made atomic | |
| 13:43:04 | mriedem | and i'm assuming the earlier patches to refactor the code didn't really require changes to the functional tests the validate the interface (unlike you'd need with unit tests) | |
| 13:43:31 | mriedem | i wrote all of the heal_allocations stuff with functional tests b/c i didn't trust unit tests for testing the interface | |
| 13:43:46 | gibi | mriedem: yes, no functional test was harmed in the refactoring process | |
| 13:43:52 | mriedem | if we fail to update one (placement or neutron) we could rollback the other... | |
| 13:44:03 | mriedem | heh, "no functional test was harmed in the making of this patch" | |