Earlier  
Posted Nick Remark
#openstack-nova - 2019-06-28
14:40:55 mriedem i agree there is some amount of "we failed our main objective, and we failed to rollback, you need to step in now" if we get there
14:41:28 mriedem but i would rather we at least *try* to rollback if possible
14:41:36 sean-k-mooney i have not been following two closely but how do you determin currently a port needs healing?
14:41:40 mriedem and it sounds like rolling back the allocation changes is harder since we merged the resources
14:41:59 mriedem sean-k-mooney: it's a port with a resource_request and doesn't have an allocation set in the binding profile
14:42:17 sean-k-mooney that could be a problem
14:42:22 mriedem that makes me think,
14:42:30 mriedem we should also be making sure the port is actually bound to a host right?
14:42:33 gibi mriedem: rolling back the allocations can be done by saving what was the original allocation to restore
14:42:34 sean-k-mooney what about cases where we set the qos policy on a network
14:43:17 mriedem gibi: ...yeah but that could also get messy right b/c we could lose a race and our generation is off
14:43:20 sean-k-mooney we only create the allocation if you pass in the port right
14:43:23 mriedem then what do we do?
14:43:30 mriedem rollling back the port binding profile allocation field seems easier to me
14:43:40 mriedem sean-k-mooney: yes
14:43:49 gibi mriedem: correct, if something else updates the allocation in between then we are rolling back to a wrong allocation
14:43:54 mriedem we do'nt support creating ports on networks with a qos policy
14:44:11 gibi mriedem: rolling back the neutron updated seems easy to me too
14:44:11 sean-k-mooney at all?
14:44:18 gibi easyier
14:44:25 sean-k-mooney or we create the ports but dont request the allcotion
14:44:51 sean-k-mooney because we created the port in the compute node
14:45:14 mriedem sean-k-mooney: this is the code that determines if we need to heal allocations for the port https://review.opendev.org/#/c/637955/28/nova/cmd/manage.py@1783
14:45:26 mriedem sean-k-mooney: we fail
14:45:52 mriedem sean-k-mooney: https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py#L468
14:46:26 mriedem gibi: so i think we agree that rolling back the port binding update is simpler than the allocation
14:46:33 gibi mriedem: good point about port bound to a host. But can it be a port with device_id=instance_uuid that is not bound?
14:46:35 mriedem and i'd prefer we include a rollback
14:47:01 mriedem gibi: "But can it be a port with device_id=instance_uuid that is not bound?" that i'm not sure about
14:47:03 mriedem sean-k-mooney: ^
14:47:05 sean-k-mooney ... ok was an api breakage on upgrade but i understand why it was done
14:47:12 mriedem sean-k-mooney: oh i think we can,
14:47:15 mriedem because of shelve offload
14:47:31 mriedem a shelved instance still has its ports and volumes
14:47:37 mriedem but those ports and volumes aren't "bound" to a host
14:47:40 sean-k-mooney yes shelve offloaded would still have the device id set
14:47:47 gibi ack
14:47:55 gibi then I have to check for boundness as well
14:48:01 mriedem does heal_allocations filter out instances that aren't on a host....
14:48:04 mriedem it should implicitly,
14:48:08 mriedem because it's using instance.node
14:48:30 sean-k-mooney gibi: you can filter py ports with vif_type!=vif-unbound
14:48:32 gibi yeah it checks instance.node then
14:48:47 gibi so we don't need an extra vif_type!=vif-unbound check
14:49:17 sean-k-mooney ok i guess that makes sense
14:49:34 gibi if we don't know the where the instance runs then we don't know which RP tree need to be targeted with the healing
14:50:10 gibi mriedem: I don't get your comment at https://review.opendev.org/#/c/668184/1/nova/cmd/manage.py@1891
14:50:11 sean-k-mooney yep. so we whould never need to heal offloaded instances
14:50:47 mriedem now i'm not so sure where we unbind the port on shelve offload,
14:50:49 gibi mriedem: is it about trying the rollback for each port even if one of it fails?
14:50:51 mriedem because https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L5168 is a noop for neutron
14:51:12 mriedem gibi: yes
14:51:22 mriedem right now you'd stop on the first port update that fails
14:51:31 mriedem i'm saying we should try to clean all and fail at the end
14:51:51 gibi mriedem: OK got it, that could be done
14:52:29 mriedem https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/network/neutronv2/api.py#L3183
14:52:39 mriedem so are ports unbound when the instance is shelved offloaded....
14:53:12 sean-k-mooney maybe in driver.destroy
14:53:15 sean-k-mooney ill go check
14:53:18 sean-k-mooney but they might not be
14:53:50 sean-k-mooney destroy should at least unplug them
14:54:07 mriedem right
14:54:14 sean-k-mooney acttuly it cant unbind
14:54:18 mriedem driver.destroy just does the unplug, the virt drivers never mess with port bindings
14:54:20 sean-k-mooney because we use destoy in reboot
14:54:28 sean-k-mooney and we dont ubind there
14:54:57 mriedem anyway for gibi's heal_allocation changes it doesn't really matter b/c we wouldn't get that far anyway for a shelved offloaded instance b/c we won't try to heal allocations on an instance that doesn't have a host est
14:55:00 mriedem *set
14:55:46 gibi ack
14:56:08 sean-k-mooney right its just something to look into to see if we have a latent bug
14:56:45 sean-k-mooney i would expect it to happen as part of moving it ot offloaded
14:57:08 sean-k-mooney but i dont think it would break anything either
14:57:10 gibi so in summary. I will add the rollback of neutron updates. But I only add a todo to possible rollback-retry
14:57:36 gibi mriedem: is that works for you ?
14:57:53 gibi mriedem: or you are a bit more on the paranoid side :)
14:58:50 mriedem the backoff retry could be a follow up
14:58:58 gibi mriedem: OK
14:59:00 mriedem try to stagger these changes as much as possible
14:59:04 mriedem since this is a large change
14:59:05 sean-k-mooney oh other edgecase. do we currently allow attaching ports with resouce requests
14:59:13 mriedem sean-k-mooney: no
14:59:17 sean-k-mooney cool
14:59:20 gibi :)
14:59:46 gibi mriedem: I can split out the helper functions from the current commit if that helps
15:00:16 mriedem gibi: sure
15:00:39 mriedem sean-k-mooney: this is where that would fail https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/api.py#L4408
15:01:25 sean-k-mooney cool. AttachInterfaceWithQoSPolicyNotSupported is slightly missleading but it makes sense
15:01:46 sean-k-mooney we allow attaching ports with QoS polices jut not with min bandwith qos policies
15:01:59 sean-k-mooney e.g. DSCP is fine as is max bandwith
15:02:40 sean-k-mooney its really we dont allow attaching with resource requests.
15:02:42 gibi sean-k-mooney: file a bug, I'm happy to amend the message of AttachInterfaceWithQoSPolicyNotSupported exception with a more specific message
15:04:08 mriedem efried: the reason i didn't jump on https://storyboard.openstack.org/#!/story/2005258 was because of the osc semantic patterns to follow for other osc commands as mentioned in the comments - i gets a bit hairy
15:04:30 mriedem rather than just like a simple --append or --remove
15:04:31 sean-k-mooney gibi: sure i could but we will also be changing that at some point too right
15:04:53 sean-k-mooney e.g. we will eventrually allow it if the host can fulfil the request but like not until U+
15:04:58 gibi sean-k-mooney: based on my current speed of progress with the server move operation patches it will be far in the future
15:05:29 gibi sean-k-mooney: but yes, in theory I would like to get the support for attach interface if the request fitts on the current host
15:05:52 sean-k-mooney well i know that neutorn want to start modeing ip pools in placment and use resouce resquest in the future
15:06:08 sean-k-mooney so i expect it to become more common
15:06:15 gibi yeah

Earlier   Later