Earlier  
Posted Nick Remark
#openstack-nova - 2019-06-28
14:33:30 mriedem gibi: i left some comments,
14:34:13 mriedem but i haven't fully thought through which is worse - the port with the binding:profile.allocation set to something when the allocation doesn't exist in neutron vs the allocation existing in neutron but the port binding profile not mapped to that provider
14:34:26 mriedem *doesn't exist in placement
14:35:42 gibi mriedem: if the rollback retry fails the is it OK to ask for the human to help?
14:35:57 gibi I feel at the end we need the human anyhow
14:37:55 gibi if we set the allocation key in neutron without having the allocation placemen then we tell neutron to use a resource that is not really allocated. But the physical bandwidth anyhow was used even before we started to heal
14:38:27 mriedem so the risk there is over-committing the resource right?
14:38:32 mriedem b/c placement isn't tracking the allocation
14:38:54 gibi yes, but the overcommit situation can already exists (hence the need of healing)
14:40:04 mriedem then isn't that better than potentially having the allocations in placement w/o the neutron port binding profile tracking the allocation and if the admin screws up the manual steps, doubling the allocation by re-running the command? iow, it's no different than the situation they could already be in
14:40:32 mriedem if you tried to run the command again we wouldn't heal that instance / port combo b/c the port would already say it's allocated when really it might not be
14:40:55 mriedem i agree there is some amount of "we failed our main objective, and we failed to rollback, you need to step in now" if we get there
14:41:28 mriedem but i would rather we at least *try* to rollback if possible
14:41:36 sean-k-mooney i have not been following two closely but how do you determin currently a port needs healing?
14:41:40 mriedem and it sounds like rolling back the allocation changes is harder since we merged the resources
14:41:59 mriedem sean-k-mooney: it's a port with a resource_request and doesn't have an allocation set in the binding profile
14:42:17 sean-k-mooney that could be a problem
14:42:22 mriedem that makes me think,
14:42:30 mriedem we should also be making sure the port is actually bound to a host right?
14:42:33 gibi mriedem: rolling back the allocations can be done by saving what was the original allocation to restore
14:42:34 sean-k-mooney what about cases where we set the qos policy on a network
14:43:17 mriedem gibi: ...yeah but that could also get messy right b/c we could lose a race and our generation is off
14:43:20 sean-k-mooney we only create the allocation if you pass in the port right
14:43:23 mriedem then what do we do?
14:43:30 mriedem rollling back the port binding profile allocation field seems easier to me
14:43:40 mriedem sean-k-mooney: yes
14:43:49 gibi mriedem: correct, if something else updates the allocation in between then we are rolling back to a wrong allocation
14:43:54 mriedem we do'nt support creating ports on networks with a qos policy
14:44:11 gibi mriedem: rolling back the neutron updated seems easy to me too
14:44:11 sean-k-mooney at all?
14:44:18 gibi easyier
14:44:25 sean-k-mooney or we create the ports but dont request the allcotion
14:44:51 sean-k-mooney because we created the port in the compute node
14:45:14 mriedem sean-k-mooney: this is the code that determines if we need to heal allocations for the port https://review.opendev.org/#/c/637955/28/nova/cmd/manage.py@1783
14:45:26 mriedem sean-k-mooney: we fail
14:45:52 mriedem sean-k-mooney: https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py#L468
14:46:26 mriedem gibi: so i think we agree that rolling back the port binding update is simpler than the allocation
14:46:33 gibi mriedem: good point about port bound to a host. But can it be a port with device_id=instance_uuid that is not bound?
14:46:35 mriedem and i'd prefer we include a rollback
14:47:01 mriedem gibi: "But can it be a port with device_id=instance_uuid that is not bound?" that i'm not sure about
14:47:03 mriedem sean-k-mooney: ^
14:47:05 sean-k-mooney ... ok was an api breakage on upgrade but i understand why it was done
14:47:12 mriedem sean-k-mooney: oh i think we can,
14:47:15 mriedem because of shelve offload
14:47:31 mriedem a shelved instance still has its ports and volumes
14:47:37 mriedem but those ports and volumes aren't "bound" to a host
14:47:40 sean-k-mooney yes shelve offloaded would still have the device id set
14:47:47 gibi ack
14:47:55 gibi then I have to check for boundness as well
14:48:01 mriedem does heal_allocations filter out instances that aren't on a host....
14:48:04 mriedem it should implicitly,
14:48:08 mriedem because it's using instance.node
14:48:30 sean-k-mooney gibi: you can filter py ports with vif_type!=vif-unbound
14:48:32 gibi yeah it checks instance.node then
14:48:47 gibi so we don't need an extra vif_type!=vif-unbound check
14:49:17 sean-k-mooney ok i guess that makes sense
14:49:34 gibi if we don't know the where the instance runs then we don't know which RP tree need to be targeted with the healing
14:50:10 gibi mriedem: I don't get your comment at https://review.opendev.org/#/c/668184/1/nova/cmd/manage.py@1891
14:50:11 sean-k-mooney yep. so we whould never need to heal offloaded instances
14:50:47 mriedem now i'm not so sure where we unbind the port on shelve offload,
14:50:49 gibi mriedem: is it about trying the rollback for each port even if one of it fails?
14:50:51 mriedem because https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L5168 is a noop for neutron
14:51:12 mriedem gibi: yes
14:51:22 mriedem right now you'd stop on the first port update that fails
14:51:31 mriedem i'm saying we should try to clean all and fail at the end
14:51:51 gibi mriedem: OK got it, that could be done
14:52:29 mriedem https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/network/neutronv2/api.py#L3183
14:52:39 mriedem so are ports unbound when the instance is shelved offloaded....
14:53:12 sean-k-mooney maybe in driver.destroy
14:53:15 sean-k-mooney ill go check
14:53:18 sean-k-mooney but they might not be
14:53:50 sean-k-mooney destroy should at least unplug them
14:54:07 mriedem right
14:54:14 sean-k-mooney acttuly it cant unbind
14:54:18 mriedem driver.destroy just does the unplug, the virt drivers never mess with port bindings
14:54:20 sean-k-mooney because we use destoy in reboot
14:54:28 sean-k-mooney and we dont ubind there
14:54:57 mriedem anyway for gibi's heal_allocation changes it doesn't really matter b/c we wouldn't get that far anyway for a shelved offloaded instance b/c we won't try to heal allocations on an instance that doesn't have a host est
14:55:00 mriedem *set
14:55:46 gibi ack
14:56:08 sean-k-mooney right its just something to look into to see if we have a latent bug
14:56:45 sean-k-mooney i would expect it to happen as part of moving it ot offloaded
14:57:08 sean-k-mooney but i dont think it would break anything either
14:57:10 gibi so in summary. I will add the rollback of neutron updates. But I only add a todo to possible rollback-retry
14:57:36 gibi mriedem: is that works for you ?
14:57:53 gibi mriedem: or you are a bit more on the paranoid side :)
14:58:50 mriedem the backoff retry could be a follow up
14:58:58 gibi mriedem: OK
14:59:00 mriedem try to stagger these changes as much as possible
14:59:04 mriedem since this is a large change
14:59:05 sean-k-mooney oh other edgecase. do we currently allow attaching ports with resouce requests
14:59:13 mriedem sean-k-mooney: no
14:59:17 sean-k-mooney cool
14:59:20 gibi :)
14:59:46 gibi mriedem: I can split out the helper functions from the current commit if that helps
15:00:16 mriedem gibi: sure
15:00:39 mriedem sean-k-mooney: this is where that would fail https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/api.py#L4408
15:01:25 sean-k-mooney cool. AttachInterfaceWithQoSPolicyNotSupported is slightly missleading but it makes sense
15:01:46 sean-k-mooney we allow attaching ports with QoS polices jut not with min bandwith qos policies
15:01:59 sean-k-mooney e.g. DSCP is fine as is max bandwith

Earlier   Later