| Posted | Nick | Remark | |
|---|---|---|---|
| #openstack-nova - 2019-06-28 | |||
| 14:38:27 | mriedem | so the risk there is over-committing the resource right? | |
| 14:38:32 | mriedem | b/c placement isn't tracking the allocation | |
| 14:38:54 | gibi | yes, but the overcommit situation can already exists (hence the need of healing) | |
| 14:40:04 | mriedem | then isn't that better than potentially having the allocations in placement w/o the neutron port binding profile tracking the allocation and if the admin screws up the manual steps, doubling the allocation by re-running the command? iow, it's no different than the situation they could already be in | |
| 14:40:32 | mriedem | if you tried to run the command again we wouldn't heal that instance / port combo b/c the port would already say it's allocated when really it might not be | |
| 14:40:55 | mriedem | i agree there is some amount of "we failed our main objective, and we failed to rollback, you need to step in now" if we get there | |
| 14:41:28 | mriedem | but i would rather we at least *try* to rollback if possible | |
| 14:41:36 | sean-k-mooney | i have not been following two closely but how do you determin currently a port needs healing? | |
| 14:41:40 | mriedem | and it sounds like rolling back the allocation changes is harder since we merged the resources | |
| 14:41:59 | mriedem | sean-k-mooney: it's a port with a resource_request and doesn't have an allocation set in the binding profile | |
| 14:42:17 | sean-k-mooney | that could be a problem | |
| 14:42:22 | mriedem | that makes me think, | |
| 14:42:30 | mriedem | we should also be making sure the port is actually bound to a host right? | |
| 14:42:33 | gibi | mriedem: rolling back the allocations can be done by saving what was the original allocation to restore | |
| 14:42:34 | sean-k-mooney | what about cases where we set the qos policy on a network | |
| 14:43:17 | mriedem | gibi: ...yeah but that could also get messy right b/c we could lose a race and our generation is off | |
| 14:43:20 | sean-k-mooney | we only create the allocation if you pass in the port right | |
| 14:43:23 | mriedem | then what do we do? | |
| 14:43:30 | mriedem | rollling back the port binding profile allocation field seems easier to me | |
| 14:43:40 | mriedem | sean-k-mooney: yes | |
| 14:43:49 | gibi | mriedem: correct, if something else updates the allocation in between then we are rolling back to a wrong allocation | |
| 14:43:54 | mriedem | we do'nt support creating ports on networks with a qos policy | |
| 14:44:11 | gibi | mriedem: rolling back the neutron updated seems easy to me too | |
| 14:44:11 | sean-k-mooney | at all? | |
| 14:44:18 | gibi | easyier | |
| 14:44:25 | sean-k-mooney | or we create the ports but dont request the allcotion | |
| 14:44:51 | sean-k-mooney | because we created the port in the compute node | |
| 14:45:14 | mriedem | sean-k-mooney: this is the code that determines if we need to heal allocations for the port https://review.opendev.org/#/c/637955/28/nova/cmd/manage.py@1783 | |
| 14:45:26 | mriedem | sean-k-mooney: we fail | |
| 14:45:52 | mriedem | sean-k-mooney: https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py#L468 | |
| 14:46:26 | mriedem | gibi: so i think we agree that rolling back the port binding update is simpler than the allocation | |
| 14:46:33 | gibi | mriedem: good point about port bound to a host. But can it be a port with device_id=instance_uuid that is not bound? | |
| 14:46:35 | mriedem | and i'd prefer we include a rollback | |
| 14:47:01 | mriedem | gibi: "But can it be a port with device_id=instance_uuid that is not bound?" that i'm not sure about | |
| 14:47:03 | mriedem | sean-k-mooney: ^ | |
| 14:47:05 | sean-k-mooney | ... ok was an api breakage on upgrade but i understand why it was done | |
| 14:47:12 | mriedem | sean-k-mooney: oh i think we can, | |
| 14:47:15 | mriedem | because of shelve offload | |
| 14:47:31 | mriedem | a shelved instance still has its ports and volumes | |
| 14:47:37 | mriedem | but those ports and volumes aren't "bound" to a host | |
| 14:47:40 | sean-k-mooney | yes shelve offloaded would still have the device id set | |
| 14:47:47 | gibi | ack | |
| 14:47:55 | gibi | then I have to check for boundness as well | |
| 14:48:01 | mriedem | does heal_allocations filter out instances that aren't on a host.... | |
| 14:48:04 | mriedem | it should implicitly, | |
| 14:48:08 | mriedem | because it's using instance.node | |
| 14:48:30 | sean-k-mooney | gibi: you can filter py ports with vif_type!=vif-unbound | |
| 14:48:32 | gibi | yeah it checks instance.node then | |
| 14:48:47 | gibi | so we don't need an extra vif_type!=vif-unbound check | |
| 14:49:17 | sean-k-mooney | ok i guess that makes sense | |
| 14:49:34 | gibi | if we don't know the where the instance runs then we don't know which RP tree need to be targeted with the healing | |
| 14:50:10 | gibi | mriedem: I don't get your comment at https://review.opendev.org/#/c/668184/1/nova/cmd/manage.py@1891 | |
| 14:50:11 | sean-k-mooney | yep. so we whould never need to heal offloaded instances | |
| 14:50:47 | mriedem | now i'm not so sure where we unbind the port on shelve offload, | |
| 14:50:49 | gibi | mriedem: is it about trying the rollback for each port even if one of it fails? | |
| 14:50:51 | mriedem | because https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L5168 is a noop for neutron | |
| 14:51:12 | mriedem | gibi: yes | |
| 14:51:22 | mriedem | right now you'd stop on the first port update that fails | |
| 14:51:31 | mriedem | i'm saying we should try to clean all and fail at the end | |
| 14:51:51 | gibi | mriedem: OK got it, that could be done | |
| 14:52:29 | mriedem | https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/network/neutronv2/api.py#L3183 | |
| 14:52:39 | mriedem | so are ports unbound when the instance is shelved offloaded.... | |
| 14:53:12 | sean-k-mooney | maybe in driver.destroy | |
| 14:53:15 | sean-k-mooney | ill go check | |
| 14:53:18 | sean-k-mooney | but they might not be | |
| 14:53:50 | sean-k-mooney | destroy should at least unplug them | |
| 14:54:07 | mriedem | right | |
| 14:54:14 | sean-k-mooney | acttuly it cant unbind | |
| 14:54:18 | mriedem | driver.destroy just does the unplug, the virt drivers never mess with port bindings | |
| 14:54:20 | sean-k-mooney | because we use destoy in reboot | |
| 14:54:28 | sean-k-mooney | and we dont ubind there | |
| 14:54:57 | mriedem | anyway for gibi's heal_allocation changes it doesn't really matter b/c we wouldn't get that far anyway for a shelved offloaded instance b/c we won't try to heal allocations on an instance that doesn't have a host est | |
| 14:55:00 | mriedem | *set | |
| 14:55:46 | gibi | ack | |
| 14:56:08 | sean-k-mooney | right its just something to look into to see if we have a latent bug | |
| 14:56:45 | sean-k-mooney | i would expect it to happen as part of moving it ot offloaded | |
| 14:57:08 | sean-k-mooney | but i dont think it would break anything either | |
| 14:57:10 | gibi | so in summary. I will add the rollback of neutron updates. But I only add a todo to possible rollback-retry | |
| 14:57:36 | gibi | mriedem: is that works for you ? | |
| 14:57:53 | gibi | mriedem: or you are a bit more on the paranoid side :) | |
| 14:58:50 | mriedem | the backoff retry could be a follow up | |
| 14:58:58 | gibi | mriedem: OK | |
| 14:59:00 | mriedem | try to stagger these changes as much as possible | |
| 14:59:04 | mriedem | since this is a large change | |
| 14:59:05 | sean-k-mooney | oh other edgecase. do we currently allow attaching ports with resouce requests | |
| 14:59:13 | mriedem | sean-k-mooney: no | |
| 14:59:17 | sean-k-mooney | cool | |
| 14:59:20 | gibi | :) | |
| 14:59:46 | gibi | mriedem: I can split out the helper functions from the current commit if that helps | |
| 15:00:16 | mriedem | gibi: sure | |
| 15:00:39 | mriedem | sean-k-mooney: this is where that would fail https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/api.py#L4408 | |
| 15:01:25 | sean-k-mooney | cool. AttachInterfaceWithQoSPolicyNotSupported is slightly missleading but it makes sense | |
| 15:01:46 | sean-k-mooney | we allow attaching ports with QoS polices jut not with min bandwith qos policies | |
| 15:01:59 | sean-k-mooney | e.g. DSCP is fine as is max bandwith | |
| 15:02:40 | sean-k-mooney | its really we dont allow attaching with resource requests. | |
| 15:02:42 | gibi | sean-k-mooney: file a bug, I'm happy to amend the message of AttachInterfaceWithQoSPolicyNotSupported exception with a more specific message | |
| 15:04:08 | mriedem | efried: the reason i didn't jump on https://storyboard.openstack.org/#!/story/2005258 was because of the osc semantic patterns to follow for other osc commands as mentioned in the comments - i gets a bit hairy | |
| 15:04:30 | mriedem | rather than just like a simple --append or --remove | |
| 15:04:31 | sean-k-mooney | gibi: sure i could but we will also be changing that at some point too right | |
| 15:04:53 | sean-k-mooney | e.g. we will eventrually allow it if the host can fulfil the request but like not until U+ | |