| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-26 | |||
| 22:14:51 | efried | ... | |
| 22:15:58 | efried | mriedem: yes, and I'll +A | |
| 22:16:52 | openstackgerrit | Matt Riedemann proposed openstack/nova master: Fix type error on call to mount device https://review.opendev.org/659780 | |
| 22:17:37 | mriedem | done | |
| 22:18:58 | efried | +A | |
| 22:25:39 | mnaser | melwitt: left a comment thanks :D | |
| 22:43:44 | melwitt | thanks | |
| #openstack-nova - 2019-06-27 | |||
| 13:13:51 | gibi | mriedem: the previous patches just refactored the heal allocation code path | |
| 13:14:10 | gibi | to make it easy to add the port healing | |
| 13:15:30 | gibi | mriedem: the basic structure of final patch | |
| 13:18:00 | gibi | mriedem: during _heal_allocations_for_instance first we get if the instance has ports that needs healing | |
| 13:18:26 | gibi | this _get_port_allocations_to_heal is the majority of the added code there | |
| 13:19:01 | gibi | as it detect is there is any port that needs healing and generates the allocation fragment for that port | |
| 13:19:33 | mriedem | ok i'm assuming just checking if the port has a resource request (if hitting neutron directly) or checking the info cache for a bound port with the allocatoin in the binding profile? | |
| 13:19:52 | gibi | we are hitting neutron directly | |
| 13:20:02 | mriedem | hmm, that's kind of expensive isn't it? | |
| 13:20:19 | mriedem | why can't we use the info cache to tell if the vif has a port allocation? | |
| 13:20:49 | mriedem | or is that some kind of chicken and egg issue? or we don't trust the cache? | |
| 13:20:55 | gibi | we need to hit neutron to see if there is resource request on the port | |
| 13:21:00 | gibi | that is not part of the vif cache | |
| 13:21:23 | mriedem | sure but the binding profile in the vif cache has a thing that tells us if it had a resource request right? | |
| 13:21:44 | gibi | mriedem: now, the binding profile in the cache only could tell us if there is allocation for the port | |
| 13:21:52 | openstackgerrit | Eric Fried proposed openstack/nova-specs master: WIP: Physical TPM passthrough https://review.opendev.org/667926 | |
| 13:21:57 | gibi | but if there is no allocation then we don't know if there is a need for such allocation | |
| 13:22:48 | mriedem | so let's say the binding profile in the vif cache doesn't have that allocation marker, but we find a port with a resource request - we're going to heal the allocation by putting allocations to placement for that instance right? | |
| 13:23:03 | mriedem | are we also going to update the cache to indicate it's now got an allocatoin? | |
| 13:23:14 | gibi | mriedem: right, and also updating neutron with the rp uuid | |
| 13:23:27 | gibi | mriedem: hm, we are not updating the info cache | |
| 13:23:29 | gibi | :/ | |
| 13:24:16 | mriedem | yeah, so i'm wondering about the scenario that our cache is busted, we go tell placement there is an allocation, but then what happens when we teardown that port/server in the compute service? are we basing that on the cache? b/c if so, we'll leak the allocation correcT? | |
| 13:24:25 | openstackgerrit | Eric Fried proposed openstack/nova master: WIP: Physical TPM passthrough https://review.opendev.org/667928 | |
| 13:25:07 | mriedem | similarly are we only creating the port allocation in heal_allocations if the port is bound? | |
| 13:25:13 | gibi | mriedem: we cannot leak allocation during delete as delete deletes everything for the consumer | |
| 13:25:28 | mriedem | gibi: what about when we detach the port though? | |
| 13:25:31 | mriedem | but don't delete the server | |
| 13:26:31 | gibi | I hope that code path works based on the allocation key in the binding profile, but let me check | |
| 13:26:57 | mriedem | https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L1672 | |
| 13:27:00 | mriedem | looks like it does... | |
| 13:27:08 | gibi | mriedem: in the other hand, is there any way to trigger a info_cache refresh or we need to do it manually? | |
| 13:28:33 | mriedem | there is a periodic task in the compute service that will forcefully rebuild the cache from info in neutron, | |
| 13:28:44 | mriedem | but i don't think that code is doing anything with the resource request / allocation information yet | |
| 13:28:55 | mriedem | might have been something i brought up when reviewing the series in stein | |
| 13:29:30 | mriedem | this is the call from the compute task https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/compute/manager.py#L7445 | |
| 13:30:20 | mriedem | this is the neutronv2 api method that builds the vif object in the cache https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L2842 | |
| 13:30:35 | mriedem | we copy the binding:profile directly off the port https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L2887 | |
| 13:31:23 | mriedem | maybe that's enough? the 'allocation' is in the binding:profile yeah? and that maps to the resource provider uuid on which the allocation belongs | |
| 13:31:53 | gibi | copying binding:profile is enough as it has the 'allocaton' key | |
| 13:32:06 | gibi | and that points to the RP | |
| 13:32:10 | gibi | we allocate from | |
| 13:32:54 | gibi | mriedem: this is where the heal port allocation makes sure that neutron binding:profile is updated https://review.opendev.org/#/c/637955/28/nova/cmd/manage.py@1857 | |
| 13:33:19 | openstackgerrit | Merged openstack/nova stable/stein: libvirt: Rework 'EBUSY' (SIGKILL) error handling code path https://review.opendev.org/667389 | |
| 13:34:22 | mriedem | ok so going back to my original question, cache vs neutron directly, i guess heal_allocations is going to the source of truth first (neutron) and working based on that, which is more fool-proof if possibly less performant | |
| 13:34:53 | mriedem | if we healed based on the cache, and the cache was stale or missing information somehow, then we would fail to heal allocations for the ports in that busted cache | |
| 13:35:16 | gibi | mriedem: there is no way to figure out what to heal if we don't hit neutron. the info cache only good to see if there is allocation or not for a port, but only neutron knows if the port needs allocation | |
| 13:35:19 | mriedem | with the way you have it, heal_allocations would heal the allocations and the port binding profile and the _heal_instance_info_cache task in the compute service running that instance would heal the cache | |
| 13:35:31 | mriedem | sure, i realize that, | |
| 13:35:40 | mriedem | i was thinking of using the cache to pre-filter the list of ports we're asking neutron about | |
| 13:36:05 | mriedem | i.e. if we're iterating over 50 intances that's 50 calls to neutron | |
| 13:36:21 | gibi | mriedem: you mean if we see allocaton key in the cache then we assume that such port doesn't need to be healed? | |
| 13:36:22 | mriedem | i do like that you added an option to skip this though | |
| 13:37:04 | mriedem | gibi: something like that | |
| 13:37:41 | mriedem | if we would have stored some flag in the vif cache originally that indicated the port had resource requests (not necessarily the resource request itself), heal_allocations could have just checked that flag in the cache and then call neutron for that port | |
| 13:38:28 | mriedem | and if ifs and buts were candy and nuts we'd all have a very fine christmas | |
| 13:38:55 | gibi | mriedem: these vifs are created _before_ there was any logic in nova about bandwidth resource. So I don't see how to have a flag in nova about it | |
| 13:39:20 | mriedem | oh right this is also for healing the case of ports attached with qos before stein... | |
| 13:39:22 | mriedem | i forgot about that | |
| 13:39:41 | gibi | this is the case exaclty that the port was attached without allocation | |
| 13:39:54 | mriedem | anyway, i'm trying to optimize something that we just want to work on the first go, so i can drop it | |
| 13:39:57 | gibi | because neutron had support way earlier for bandiwdth | |
| 13:40:50 | mriedem | the other good thing is that heal_allocations now has the option to heal a single instance so if an operator is trying to fix one specific instance (b/c of some user ticket or something) then they can just target that one rather than all instances to heal 1 | |
| 13:41:40 | gibi | yeah, I remember I had to adapt to that improvement at some point | |
| 13:42:52 | gibi | one extra complexity to note. as we need to update both placement and neutron the update cannot be made atomic | |
| 13:43:04 | mriedem | and i'm assuming the earlier patches to refactor the code didn't really require changes to the functional tests the validate the interface (unlike you'd need with unit tests) | |
| 13:43:31 | mriedem | i wrote all of the heal_allocations stuff with functional tests b/c i didn't trust unit tests for testing the interface | |
| 13:43:46 | gibi | mriedem: yes, no functional test was harmed in the refactoring process | |
| 13:43:52 | mriedem | if we fail to update one (placement or neutron) we could rollback the other... | |
| 13:44:03 | mriedem | heh, "no functional test was harmed in the making of this patch" | |
| 13:44:37 | mriedem | ok this has been a useful discussion before i dive in | |
| 13:44:39 | mriedem | thanks | |
| 13:45:06 | gibi | mriedem: yes, rollback is one option. I went for just printing what failed and how to update neutron manually | |
| 13:46:05 | gibi | mriedem: if you think rollback is better then I have to temprarly store the original allocation of the instance before the heal, and put that back if neutron update fails | |
| 13:50:05 | openstackgerrit | Ghanshyam Mann proposed openstack/nova master: Multiple API cleanup changes https://review.opendev.org/666889 | |
| 13:51:14 | mriedem | we could also retry the port updates with a backoff loop, like 3 retries or something if it's a temporary network issue | |
| 13:51:19 | mriedem | could be follow ups though | |
| 13:51:35 | mriedem | i'll keep it in mind when i review and leave a comment | |
| 13:54:42 | gibi | mriedem: ack | |
| 14:02:36 | mriedem | nova meeting happening | |
| 14:10:21 | mriedem | gibi: also in the back of my mind i've been thinking of adding something to the nova-next job post_test_hook, like creating a server, deleting it's allocatoins in placement and then running heal_allocations just to make sure we have integration test coverage as well, but it's lower priority | |
| 14:11:22 | gibi | mriedem: I did similar manual testing for my patch in devstack and discovered bugs. So I agree that it would be useful | |
| 14:20:27 | jrosser | guilhermesp: this lot https://review.opendev.org/#/q/topic:fix-octavia+(status:open+OR+status:merged) | |
| 14:20:54 | jrosser | guilhermesp: there are some jobs running with some depends-on, but it's difficult to see whats going on | |
| 14:21:34 | jrosser | oops -ECHAN | |
| 14:50:10 | openstackgerrit | Surya Seetharaman proposed openstack/nova stable/stein: Grab fresh power state info from the driver https://review.opendev.org/667948 | |
| 14:54:35 | mriedem | http://status.openstack.org/reviews/#nova sure is fun for abandon fodder | |
| 14:55:05 | openstackgerrit | Vrushali Kamde proposed openstack/nova master: Support filtering of hosts by forbidden aggregates https://review.opendev.org/667952 | |
| 14:58:33 | efried | abandon away | |
| 15:09:07 | openstackgerrit | Surya Seetharaman proposed openstack/nova stable/rocky: Grab fresh power state info from the driver https://review.opendev.org/667955 | |
| 15:24:55 | openstackgerrit | Matt Riedemann proposed openstack/nova master: RT: replace _instance_in_resize_state with _is_trackable_migration https://review.opendev.org/560467 | |
| 15:25:18 | mriedem | efried: another rebase on that one ^ | |
| 15:25:39 | mriedem | dansmith: can you come back on this instance.hidden patch https://review.opendev.org/#/c/631123/ ? | |