Search Results

Posted	Nick	Remark
#openstack-nova - 2019-06-28
14:33:30	mriedem	gibi: i left some comments,
14:34:13	mriedem	but i haven't fully thought through which is worse - the port with the binding:profile.allocation set to something when the allocation doesn't exist in neutron vs the allocation existing in neutron but the port binding profile not mapped to that provider
14:34:26	mriedem	*doesn't exist in placement
14:35:42	gibi	mriedem: if the rollback retry fails the is it OK to ask for the human to help?
14:35:57	gibi	I feel at the end we need the human anyhow
14:37:55	gibi	if we set the allocation key in neutron without having the allocation placemen then we tell neutron to use a resource that is not really allocated. But the physical bandwidth anyhow was used even before we started to heal
14:38:27	mriedem	so the risk there is over-committing the resource right?
14:38:32	mriedem	b/c placement isn't tracking the allocation
14:38:54	gibi	yes, but the overcommit situation can already exists (hence the need of healing)
14:40:04	mriedem	then isn't that better than potentially having the allocations in placement w/o the neutron port binding profile tracking the allocation and if the admin screws up the manual steps, doubling the allocation by re-running the command? iow, it's no different than the situation they could already be in
14:40:32	mriedem	if you tried to run the command again we wouldn't heal that instance / port combo b/c the port would already say it's allocated when really it might not be
14:40:55	mriedem	i agree there is some amount of "we failed our main objective, and we failed to rollback, you need to step in now" if we get there
14:41:28	mriedem	but i would rather we at least try to rollback if possible
14:41:36	sean-k-mooney	i have not been following two closely but how do you determin currently a port needs healing?
14:41:40	mriedem	and it sounds like rolling back the allocation changes is harder since we merged the resources
14:41:59	mriedem	sean-k-mooney: it's a port with a resource_request and doesn't have an allocation set in the binding profile
14:42:17	sean-k-mooney	that could be a problem
14:42:22	mriedem	that makes me think,
14:42:30	mriedem	we should also be making sure the port is actually bound to a host right?
14:42:33	gibi	mriedem: rolling back the allocations can be done by saving what was the original allocation to restore
14:42:34	sean-k-mooney	what about cases where we set the qos policy on a network
14:43:17	mriedem	gibi: ...yeah but that could also get messy right b/c we could lose a race and our generation is off
14:43:20	sean-k-mooney	we only create the allocation if you pass in the port right
14:43:23	mriedem	then what do we do?
14:43:30	mriedem	rollling back the port binding profile allocation field seems easier to me
14:43:40	mriedem	sean-k-mooney: yes
14:43:49	gibi	mriedem: correct, if something else updates the allocation in between then we are rolling back to a wrong allocation
14:43:54	mriedem	we do'nt support creating ports on networks with a qos policy
14:44:11	gibi	mriedem: rolling back the neutron updated seems easy to me too
14:44:11	sean-k-mooney	at all?
14:44:18	gibi	easyier
14:44:25	sean-k-mooney	or we create the ports but dont request the allcotion
14:44:51	sean-k-mooney	because we created the port in the compute node
14:45:14	mriedem	sean-k-mooney: this is the code that determines if we need to heal allocations for the port https://review.opendev.org/#/c/637955/28/nova/cmd/manage.py@1783
14:45:26	mriedem	sean-k-mooney: we fail
14:45:52	mriedem	sean-k-mooney: https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py#L468
14:46:26	mriedem	gibi: so i think we agree that rolling back the port binding update is simpler than the allocation
14:46:33	gibi	mriedem: good point about port bound to a host. But can it be a port with device_id=instance_uuid that is not bound?
14:46:35	mriedem	and i'd prefer we include a rollback
14:47:01	mriedem	gibi: "But can it be a port with device_id=instance_uuid that is not bound?" that i'm not sure about
14:47:03	mriedem	sean-k-mooney: ^
14:47:05	sean-k-mooney	... ok was an api breakage on upgrade but i understand why it was done
14:47:12	mriedem	sean-k-mooney: oh i think we can,
14:47:15	mriedem	because of shelve offload
14:47:31	mriedem	a shelved instance still has its ports and volumes
14:47:37	mriedem	but those ports and volumes aren't "bound" to a host
14:47:40	sean-k-mooney	yes shelve offloaded would still have the device id set
14:47:47	gibi	ack
14:47:55	gibi	then I have to check for boundness as well
14:48:01	mriedem	does heal_allocations filter out instances that aren't on a host....
14:48:04	mriedem	it should implicitly,
14:48:08	mriedem	because it's using instance.node
14:48:30	sean-k-mooney	gibi: you can filter py ports with vif_type!=vif-unbound
14:48:32	gibi	yeah it checks instance.node then
14:48:47	gibi	so we don't need an extra vif_type!=vif-unbound check
14:49:17	sean-k-mooney	ok i guess that makes sense
14:49:34	gibi	if we don't know the where the instance runs then we don't know which RP tree need to be targeted with the healing
14:50:10	gibi	mriedem: I don't get your comment at https://review.opendev.org/#/c/668184/1/nova/cmd/manage.py@1891
14:50:11	sean-k-mooney	yep. so we whould never need to heal offloaded instances
14:50:47	mriedem	now i'm not so sure where we unbind the port on shelve offload,
14:50:49	gibi	mriedem: is it about trying the rollback for each port even if one of it fails?
14:50:51	mriedem	because https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L5168 is a noop for neutron
14:51:12	mriedem	gibi: yes
14:51:22	mriedem	right now you'd stop on the first port update that fails
14:51:31	mriedem	i'm saying we should try to clean all and fail at the end
14:51:51	gibi	mriedem: OK got it, that could be done
14:52:29	mriedem	https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/network/neutronv2/api.py#L3183
14:52:39	mriedem	so are ports unbound when the instance is shelved offloaded....
14:53:12	sean-k-mooney	maybe in driver.destroy
14:53:15	sean-k-mooney	ill go check
14:53:18	sean-k-mooney	but they might not be
14:53:50	sean-k-mooney	destroy should at least unplug them
14:54:07	mriedem	right
14:54:14	sean-k-mooney	acttuly it cant unbind
14:54:18	mriedem	driver.destroy just does the unplug, the virt drivers never mess with port bindings
14:54:20	sean-k-mooney	because we use destoy in reboot
14:54:28	sean-k-mooney	and we dont ubind there
14:54:57	mriedem	anyway for gibi's heal_allocation changes it doesn't really matter b/c we wouldn't get that far anyway for a shelved offloaded instance b/c we won't try to heal allocations on an instance that doesn't have a host est
14:55:00	mriedem	*set
14:55:46	gibi	ack
14:56:08	sean-k-mooney	right its just something to look into to see if we have a latent bug
14:56:45	sean-k-mooney	i would expect it to happen as part of moving it ot offloaded
14:57:08	sean-k-mooney	but i dont think it would break anything either
14:57:10	gibi	so in summary. I will add the rollback of neutron updates. But I only add a todo to possible rollback-retry
14:57:36	gibi	mriedem: is that works for you ?
14:57:53	gibi	mriedem: or you are a bit more on the paranoid side :)
14:58:50	mriedem	the backoff retry could be a follow up
14:58:58	gibi	mriedem: OK
14:59:00	mriedem	try to stagger these changes as much as possible
14:59:04	mriedem	since this is a large change
14:59:05	sean-k-mooney	oh other edgecase. do we currently allow attaching ports with resouce requests
14:59:13	mriedem	sean-k-mooney: no
14:59:17	sean-k-mooney	cool
14:59:20	gibi	:)
14:59:46	gibi	mriedem: I can split out the helper functions from the current commit if that helps
15:00:16	mriedem	gibi: sure
15:00:39	mriedem	sean-k-mooney: this is where that would fail https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/api.py#L4408
15:01:25	sean-k-mooney	cool. AttachInterfaceWithQoSPolicyNotSupported is slightly missleading but it makes sense
15:01:46	sean-k-mooney	we allow attaching ports with QoS polices jut not with min bandwith qos policies
15:01:59	sean-k-mooney	e.g. DSCP is fine as is max bandwith

Earlier Later