Search Results

Posted	Nick	Remark
#openstack-nova - 2019-06-27
13:13:51	gibi	mriedem: the previous patches just refactored the heal allocation code path
13:14:10	gibi	to make it easy to add the port healing
13:15:30	gibi	mriedem: the basic structure of final patch
13:18:00	gibi	mriedem: during _heal_allocations_for_instance first we get if the instance has ports that needs healing
13:18:26	gibi	this _get_port_allocations_to_heal is the majority of the added code there
13:19:01	gibi	as it detect is there is any port that needs healing and generates the allocation fragment for that port
13:19:33	mriedem	ok i'm assuming just checking if the port has a resource request (if hitting neutron directly) or checking the info cache for a bound port with the allocatoin in the binding profile?
13:19:52	gibi	we are hitting neutron directly
13:20:02	mriedem	hmm, that's kind of expensive isn't it?
13:20:19	mriedem	why can't we use the info cache to tell if the vif has a port allocation?
13:20:49	mriedem	or is that some kind of chicken and egg issue? or we don't trust the cache?
13:20:55	gibi	we need to hit neutron to see if there is resource request on the port
13:21:00	gibi	that is not part of the vif cache
13:21:23	mriedem	sure but the binding profile in the vif cache has a thing that tells us if it had a resource request right?
13:21:44	gibi	mriedem: now, the binding profile in the cache only could tell us if there is allocation for the port
13:21:52	openstackgerrit	Eric Fried proposed openstack/nova-specs master: WIP: Physical TPM passthrough https://review.opendev.org/667926
13:21:57	gibi	but if there is no allocation then we don't know if there is a need for such allocation
13:22:48	mriedem	so let's say the binding profile in the vif cache doesn't have that allocation marker, but we find a port with a resource request - we're going to heal the allocation by putting allocations to placement for that instance right?
13:23:03	mriedem	are we also going to update the cache to indicate it's now got an allocatoin?
13:23:14	gibi	mriedem: right, and also updating neutron with the rp uuid
13:23:27	gibi	mriedem: hm, we are not updating the info cache
13:23:29	gibi	:/
13:24:16	mriedem	yeah, so i'm wondering about the scenario that our cache is busted, we go tell placement there is an allocation, but then what happens when we teardown that port/server in the compute service? are we basing that on the cache? b/c if so, we'll leak the allocation correcT?
13:24:25	openstackgerrit	Eric Fried proposed openstack/nova master: WIP: Physical TPM passthrough https://review.opendev.org/667928
13:25:07	mriedem	similarly are we only creating the port allocation in heal_allocations if the port is bound?
13:25:13	gibi	mriedem: we cannot leak allocation during delete as delete deletes everything for the consumer
13:25:28	mriedem	gibi: what about when we detach the port though?
13:25:31	mriedem	but don't delete the server
13:26:31	gibi	I hope that code path works based on the allocation key in the binding profile, but let me check
13:26:57	mriedem	https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L1672
13:27:00	mriedem	looks like it does...
13:27:08	gibi	mriedem: in the other hand, is there any way to trigger a info_cache refresh or we need to do it manually?
13:28:33	mriedem	there is a periodic task in the compute service that will forcefully rebuild the cache from info in neutron,
13:28:44	mriedem	but i don't think that code is doing anything with the resource request / allocation information yet
13:28:55	mriedem	might have been something i brought up when reviewing the series in stein
13:29:30	mriedem	this is the call from the compute task https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/compute/manager.py#L7445
13:30:20	mriedem	this is the neutronv2 api method that builds the vif object in the cache https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L2842
13:30:35	mriedem	we copy the binding:profile directly off the port https://github.com/openstack/nova/blob/7b769ad403751268c60b095f722437cbed692071/nova/network/neutronv2/api.py#L2887
13:31:23	mriedem	maybe that's enough? the 'allocation' is in the binding:profile yeah? and that maps to the resource provider uuid on which the allocation belongs
13:31:53	gibi	copying binding:profile is enough as it has the 'allocaton' key
13:32:06	gibi	and that points to the RP
13:32:10	gibi	we allocate from
13:32:54	gibi	mriedem: this is where the heal port allocation makes sure that neutron binding:profile is updated https://review.opendev.org/#/c/637955/28/nova/cmd/manage.py@1857
13:33:19	openstackgerrit	Merged openstack/nova stable/stein: libvirt: Rework 'EBUSY' (SIGKILL) error handling code path https://review.opendev.org/667389
13:34:22	mriedem	ok so going back to my original question, cache vs neutron directly, i guess heal_allocations is going to the source of truth first (neutron) and working based on that, which is more fool-proof if possibly less performant
13:34:53	mriedem	if we healed based on the cache, and the cache was stale or missing information somehow, then we would fail to heal allocations for the ports in that busted cache
13:35:16	gibi	mriedem: there is no way to figure out what to heal if we don't hit neutron. the info cache only good to see if there is allocation or not for a port, but only neutron knows if the port needs allocation
13:35:19	mriedem	with the way you have it, heal_allocations would heal the allocations and the port binding profile and the _heal_instance_info_cache task in the compute service running that instance would heal the cache
13:35:31	mriedem	sure, i realize that,
13:35:40	mriedem	i was thinking of using the cache to pre-filter the list of ports we're asking neutron about
13:36:05	mriedem	i.e. if we're iterating over 50 intances that's 50 calls to neutron
13:36:21	gibi	mriedem: you mean if we see allocaton key in the cache then we assume that such port doesn't need to be healed?
13:36:22	mriedem	i do like that you added an option to skip this though
13:37:04	mriedem	gibi: something like that
13:37:41	mriedem	if we would have stored some flag in the vif cache originally that indicated the port had resource requests (not necessarily the resource request itself), heal_allocations could have just checked that flag in the cache and then call neutron for that port
13:38:28	mriedem	and if ifs and buts were candy and nuts we'd all have a very fine christmas
13:38:55	gibi	mriedem: these vifs are created _before_ there was any logic in nova about bandwidth resource. So I don't see how to have a flag in nova about it
13:39:20	mriedem	oh right this is also for healing the case of ports attached with qos before stein...
13:39:22	mriedem	i forgot about that
13:39:41	gibi	this is the case exaclty that the port was attached without allocation
13:39:54	mriedem	anyway, i'm trying to optimize something that we just want to work on the first go, so i can drop it
13:39:57	gibi	because neutron had support way earlier for bandiwdth
13:40:50	mriedem	the other good thing is that heal_allocations now has the option to heal a single instance so if an operator is trying to fix one specific instance (b/c of some user ticket or something) then they can just target that one rather than all instances to heal 1
13:41:40	gibi	yeah, I remember I had to adapt to that improvement at some point
13:42:52	gibi	one extra complexity to note. as we need to update both placement and neutron the update cannot be made atomic
13:43:04	mriedem	and i'm assuming the earlier patches to refactor the code didn't really require changes to the functional tests the validate the interface (unlike you'd need with unit tests)
13:43:31	mriedem	i wrote all of the heal_allocations stuff with functional tests b/c i didn't trust unit tests for testing the interface
13:43:46	gibi	mriedem: yes, no functional test was harmed in the refactoring process
13:43:52	mriedem	if we fail to update one (placement or neutron) we could rollback the other...
13:44:03	mriedem	heh, "no functional test was harmed in the making of this patch"
13:44:37	mriedem	ok this has been a useful discussion before i dive in
13:44:39	mriedem	thanks
13:45:06	gibi	mriedem: yes, rollback is one option. I went for just printing what failed and how to update neutron manually
13:46:05	gibi	mriedem: if you think rollback is better then I have to temprarly store the original allocation of the instance before the heal, and put that back if neutron update fails
13:50:05	openstackgerrit	Ghanshyam Mann proposed openstack/nova master: Multiple API cleanup changes https://review.opendev.org/666889
13:51:14	mriedem	we could also retry the port updates with a backoff loop, like 3 retries or something if it's a temporary network issue
13:51:19	mriedem	could be follow ups though
13:51:35	mriedem	i'll keep it in mind when i review and leave a comment
13:54:42	gibi	mriedem: ack
14:02:36	mriedem	nova meeting happening
14:10:21	mriedem	gibi: also in the back of my mind i've been thinking of adding something to the nova-next job post_test_hook, like creating a server, deleting it's allocatoins in placement and then running heal_allocations just to make sure we have integration test coverage as well, but it's lower priority
14:11:22	gibi	mriedem: I did similar manual testing for my patch in devstack and discovered bugs. So I agree that it would be useful
14:20:27	jrosser	guilhermesp: this lot https://review.opendev.org/#/q/topic:fix-octavia+(status:open+OR+status:merged)
14:20:54	jrosser	guilhermesp: there are some jobs running with some depends-on, but it's difficult to see whats going on
14:21:34	jrosser	oops -ECHAN
14:50:10	openstackgerrit	Surya Seetharaman proposed openstack/nova stable/stein: Grab fresh power state info from the driver https://review.opendev.org/667948
14:54:35	mriedem	http://status.openstack.org/reviews/#nova sure is fun for abandon fodder
14:55:05	openstackgerrit	Vrushali Kamde proposed openstack/nova master: Support filtering of hosts by forbidden aggregates https://review.opendev.org/667952
14:58:33	efried	abandon away
15:09:07	openstackgerrit	Surya Seetharaman proposed openstack/nova stable/rocky: Grab fresh power state info from the driver https://review.opendev.org/667955
15:24:55	openstackgerrit	Matt Riedemann proposed openstack/nova master: RT: replace _instance_in_resize_state with _is_trackable_migration https://review.opendev.org/560467
15:25:18	mriedem	efried: another rebase on that one ^
15:25:39	mriedem	dansmith: can you come back on this instance.hidden patch https://review.opendev.org/#/c/631123/ ?
15:26:06	dansmith	<insert hidden pun here>
15:26:13	mriedem	dansmith.hidden = True
15:27:40	efried	mriedem: I can't even see where the conflict was on that one.
15:28:46	efried	oh, never mind
15:30:52	dansmith	mriedem: is the hidden check in the api really required? aren't you filtering out hidden instances from the list query?
15:31:35	mriedem	dansmith: i'd have to look again to confirm but i believe there is a window where both are not hidden
15:31:38	mriedem	while swapping over

Earlier Later