-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNV-46779: Udn for virt workloads #1673
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@maiqueb: This pull request references CNV-46779 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@maiqueb: This pull request references CNV-46779 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.18." or "openshift-4.18.", but it targets "CNV v4.18.0" instead. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
||
This requires an enhancement on passt. | ||
|
||
TODO: link the ticket to get an enhancement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev let me know once we know what we require from passt, and have an opened ticket we can track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, will continue soon where i left
Top notch, thanks
to use the primary network to which the VM is connected for north/south, while | ||
still being able to connect to KAPI and consume Kubernetes DNS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing probes beside KAPI / DNS ?
In short, KubeVirt (or a component on behalf | ||
of KubeVirt) creates an `IPAMClaim` resource for each interface in the VM the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth to not mention kubevirt as the one that deploy it, because it is not true and not going to be changed
worth to say a component is responsible to create .... and to mention ipam-extensions
same for "and KubeVirt instructs the CNI plugin" few lines below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to protect people that are not kubevirt savvy from understanding our shenanigans. And save me from explaining how we got there, what that project is, etc.
I think it doesn't matter much: the point is whatever component creates IPAMClaim
s understands the lifecycle of VMs.
do not require to update the KubeVirt API, which is something we have faced | ||
extreme resistance in the past. | ||
|
||
Keep in mind that up to now the IPAMClaims would be created by KubeVirt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above (i cant link it, because it is in the middle of "review" so there is no links yet)
kubevirt is not the one that creates it
the desired IP addresses for the VM, and when creating the VM, point it to the | ||
respective `IPAMClaim` using an annotation (on the VM). | ||
|
||
KubeVirt (or a component on behalf of it) will see the annotation on the VM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it is even an action item, not just something that already exists (so the question is bit different than the ones above)
the ipam-ext should do it right ? i think it would be better to be precise here, and not mention kubevirt ?
(even that we have "or a component on behalf ...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would that component. I just don't understand what anyone gets from having to know there's an extra moving piece.
What is important is that it is something that understands what a VM is.
4e73b64
to
5f83dbe
Compare
@maiqueb: This pull request references CNV-46779 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.18." or "openshift-4.18.", but it targets "CNV v4.18.0" instead. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
||
#### Pre-requirements - configure the primary UDN for a namespace | ||
The user will need to perform the usual steps to configure a primary UDN for | ||
their namespace - that is, to provision a `UserDefinedNetwork` of layer2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does primary-UDN only support layer2 topology? I remember the UDN enhancement talking also on layer3..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the virt integration, we only care about L2.
// The network name for which this persistent allocation was created | ||
Network string `json:"network"` | ||
// The pod interface name for which this allocation was created | ||
Interface string `json:"interface"` // *new* attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's just my experience, but putting Network
and Interface
in the same level kinda suggests that they are referring to the VM.Spec's Network
and Interface
fields (which are connected in a 1:1 relationship anyways AFAIU). They are not the same though right? The Interface
field here is talking about the interface inside the pod. Perhaps we should make it more explicit then?
Interface string `json:"interface"` // *new* attribute | |
PodInterface string `json:"podInterface"` // *new* attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says right there in the comment:
The pod interface name for which this allocation was created
@maiqueb: This pull request references CNV-46779 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.18." or "openshift-4.18.", but it targets "CNV v4.18.0" instead. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
eb5297e
to
b8be505
Compare
Live-migration is the bread and butter of any virtualization solution. | ||
OpenShift Virtualization does not currently support live-migration over the | ||
OpenShift default cluster network, for a variety of reasons. The scope of this | ||
enhancement is to define how to use the existent | ||
[primary UDN feature](https://github.com/openshift/enhancements/pull/1623/) to | ||
enable virtualization workloads to migrate over the primary network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also proper virtual machine restart is important and we are covering it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please resolve if you're OK.
enhancements/network/user-defined-network-for-virt-workloads.md
Outdated
Show resolved
Hide resolved
enhancements/network/user-defined-network-for-virt-workloads.md
Outdated
Show resolved
Hide resolved
enhancements/network/user-defined-network-for-virt-workloads.md
Outdated
Show resolved
Hide resolved
enhancements/network/user-defined-network-for-virt-workloads.md
Outdated
Show resolved
Hide resolved
It makes more sense that multiple primary networks will be used in the hosted | ||
cluster in order to provide tenants with better isolation from each other, | ||
without the need for network policy. | ||
There should be no hypershift platform-specific considerations with this | ||
feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case hypershift move to primary UDN + passt, the fact that passt is not passing hostname over DHCP will break hypershift hosted clusters since there is an openshift check that ensure that nodes are not "localhost" hostnamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature was not requested for HyperShift. I can list this though, but for that, I'd need to have an opened issue to be tracked.
which makes use of the multi requested-chassis OVN feature, in which a single | ||
logical switch port can be bound to multiple chassis (and have traffic | ||
forwarded to them). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add here the passt issues ?
- live migration break connections
- no DHCP hostname
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the latter is a non-issue to the requirements we have.
the former has been analyzed in the proposal section, and an alternative was provided. I think that's more than enough.
N/A | ||
|
||
## Alternatives | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to annotation the virt-launcher pod would be to labeling so it's easier to filter the IPAMClaim with the pod, now that we are going to reconcile them that would be possible, but not sure if adds value.
- would currently only work for IPv4. We would need to advertise routes either | ||
from OVN-Kubernetes, or from within the pod (which would require CAP_NET_ADMIN, | ||
which we do **not** want to reintroduce). | ||
- integrating with pod network ecosystem (service mesh, metrics, etc) difficult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember if mac spoofing mechanism was not working with linux bridge or was the opposite ? Don't know if it would work for passt though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a mac spoofing mechanism implemented at kubevirt bindings, don't know if we miss that with passt or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, there's MAC spoofing implemented at bridge CNI.
There's also MAC (and IP) spoofing implemented in OVN-K so we do more. This is orthogonal to the binding used.
enhancements/network/user-defined-network-for-virt-workloads.md
Outdated
Show resolved
Hide resolved
This commit adds the summary, motivation, and user stories sections. Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
…m the subnet range Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
OpenShift Virtualization does not currently support live-migration over the | ||
OpenShift default cluster network, for a variety of reasons. The scope of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true. We do support live migration over the pod network with use of masquerade. I would say we "allow live migration over the cluster network, but it requires a compromise in form of NAT which is making the solution unacceptable for many use-cases".
You may argue that TCP disconnect is not making it "live", but except that one issue we tick all the boxes of a live migration, and it is called live-migration by developers and users alike. So to avoid any confusion, I suggest changing this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I argue that it's not only TCP disconnect, but TCP disconnect without any chance of ever reconnecting unless the clients go and learn the new IP to contact the VM.
I can compromise for "OpenShift Virtualization live-migration over the default network currently requires a compromise (in the form of NAT) which makes the solution unacceptable for any use case that involves network connectivity".
Is that wrong ?
Signed-off-by: Miguel Duarte Barroso <[email protected]>
… IP address Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Jaime asked to be removed as a reviewer. Signed-off-by: Miguel Duarte Barroso <[email protected]>
c0aa6fa
to
e349ed3
Compare
This other alternative has the advantage of not requiring a KubeVirt API update | ||
but adds a lot more work in OVN-Kubernetes, and requires updating the de-facto | ||
standard. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option:
Another option is to manage IP assignments on `UserDefinedNetwork`, where IPs from the excluded subnet could be assigned to a VM or a Pod by matching it's namespace and name. | |
The benefits of this are most obvious when looking at `ClusterUserDefinedNetwork`. Keeping all the mappings at a single place would make it easier to get a full picture of free and assigned IPs. Keeping the mapping on the UDN may be also considered more secure - the network belongs to the person defining the subnet, in case of `CUDN` to the cluster-admin. With this approach it will be always this owner giving away static IPs. This mimicks the workflow in traditional networks (reserving IPs on DHCP server configuration), and also the API used by VMware. | |
The implementation could look like this: | |
* We keep IPAMClaims as they are | |
* UDN now has a new section where the cluster/project admin can assign IPs from excluded subnet to Pods/VMs | |
* When Pod is created, OVN checks if it has reserved IP on UDN, if it does, it uses it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting; it approaches the problem in a different way: I had assumed we would want the VM owner to be able to specify the IP; this approach moves that responsibility / ability to the admin.
Without getting into the technical limitations we would face for it (now) it is quite clear we need to clarify which persona we want to be able to indicate the VM's IPs - the admin, or the VM owner.
I've reached some stake holders / field people. I'll clarify once I have heard from them.
Until we know exactly who the "user" is for that, no point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the suggested alternative is not limited to a cluster admin, it is just a matter of how it is eventually implemented. IMO the concept itself is the most important part, managing IP pools and ranges in a centralized location and treat individual specific VM control as an exception (e.g. a pool of a single IP in it, but there are probably many other ways to approach the need).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am splitting the review between the overview & motivation part and the solution proposal.
I am posting my feedback on the first part now.
@@ -0,0 +1,638 @@ | |||
--- | |||
title: primary-user-defined-networks-for-virtualization-workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioning UDN in the title, summary and motivationis mixes a solution proposal with the need. It embeds the solution in advance, which is limiting and blocks alternatives (even though may not be practical at this stage).
I think it will be beneficial to make sure the need is not mentioning any specific solution and the proposal adds the other options that could have been explored, even though it may be too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enhancement is about how to integrate with the existing UDN solution.
While a document such as the one you're suggesting makes sense, studying and identifying what could have beens is not our goal.
Virtualization users are commonly used to having layer2 networks, and to having | ||
their tenant networks isolated from other tenants. This is something that | ||
opposes the Kubernetes networking model, in which all pods can reach other pods, | ||
and security is provided by implementing Network Policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part assumes that Virt users require access to the pod/primary network for their needs.
If there is such a need, can you please add it before this paragraph?
In general, the tenancy, isolation and L2 expectations is a need that can be solved through secondary networks. Therefore, the suggestion to use the primary pod network is a solution, not the motivation.
But there may be specific needs to use the primary pod network for the VIrt users, so lets list them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being specific, AFAICT, you won't be able to use secondary networks in cloud environments (AWS / azure / ...) for anything other than east / west.
Hence, this type of users (requiring cluster egress on cloud environments) must use the primary network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phoracek can you point some more usages of the pod network ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
### User Stories | ||
|
||
- As the owner of an application running inside my VM, I want to have east/west | ||
communication without NAT to other VMs and/or pods within the same network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From later user stories we can understand why a single subnet is needed (e.g. for migration).
Is there some other reason NAT or different subnets rejected by such users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is literally what the users are requesting. @phoracek maybe you can provide more insight into their motives ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to meet the expectations of the VM workload that users want to migrate to OpenShift.
Users exploring masquerade on Pod network often raise one of these two issues, preventing them from using it:
- Their existing workload uses the IP seen in the guest as the ID of the system. With masquerade, this IP is always the same, breaking the system
- Their existing workload would be a distributed application, where each member run on its VM. It takes the IP seen inside the guest and reports it to a central system saying "find me here". Of course, this IP is not reachable by others (or goes to loopback)
In short, application inside a VM expects that its IP is A) unique and B) addressable by its neighbors.
I cannot name the exact applications, but I remember distributed databases and Windows suffering from these problems
enhancements/network/primary-user-defined-networks-for-virtualization-workloads.md
Show resolved
Hide resolved
enhancements/network/primary-user-defined-networks-for-virtualization-workloads.md
Outdated
Show resolved
Hide resolved
- As a VM owner, I want to be able to specify the IP address for the interface | ||
of my VM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please examine if this is a real story from the field or that the need makes sense?
Managing a pool of VMs at medium or large scale makes this story less likely to exist in this form.
E.g. if the need is to import VMs from a different platform, the story is different from the need to define IPs on a per VM individually.
I would expect operators to control the subnet used for a specific group of VMs or that sticky IPs which VMs are assigned are retained when converted to OCP-Virt. These stories may end up with a different implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. As indicated in #1673 (comment), we need to clarify this user story. Especially since we don't know who's the user in the story.
- Provide a configurable way for the user to define the IP addresses on a VM's | ||
interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is something we need to discuss further.
We need a strong reasoning to aim for this per-VM solution vs a more general pool based management.
I would expect users to be allowed to set the subnet in terms of an IP range and then be assigned sticky IPs from that pool. For scenarios where VMs are imported from other platforms and the IP must be preserved (do we have a ref for such a request?), I would recommend a more centralized general solution. The goal will be to allow such an import and the implementation listed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. This needs clarifications from PMs; waiting on it.
|
||
KubeVirt uses a concept called bind mechanisms to extend networking to the VM; | ||
we plan on using the | ||
[passt](https://passt.top/passt/about/#passt-plug-a-simple-socket-transport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the recent talks about splitting the efforts to:
4.18: using tap/macvtap
4.19: using passt
Can you please update the plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it is enough to summarize here only that a dedicated binding is chosen, but not providing too much details. This should allow us to change the details without the need to update this proposal.
But this is up to the author how to manage this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just link the binding enhancement. Simpler.
While this would be a native way to integrate with the rest of the persistent | ||
IPs feature, it would be a clumsy flow for primary UDNs, since the user would | ||
be expected to use network selection elements to configure the primary UDN. | ||
|
||
Furthermore, we currently not allow primary UDNs to be attached via network | ||
selection elements. We would need to break this convention. | ||
|
||
This route could be interesting in the future though, since we expect this to | ||
become the way to request SR-IOV resources for primary UDNs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is only highlighting the negatives of this alternative, but not the positives (or the juxtaposed negatives of the main proposal).
It might look weird and confusing that there is two different ways to use IPAM claims: NSE for secondary network but a different annotation for primary networks; when we could just go for what we are already doing.
It is highly probable that we will need NSE to customize the primary UDN network in the future, just as the default network can be customized today. When that day comes, everything will look even weirder because then you would be passing a NSE and you will be asking yourself on which of the two annotations the ipam claim reference has to go. Best case, we support and be forced to maintain both options.
Moreover, I think there is really no specific reason we don't allow NSE for primary UDNs at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is only highlighting the negatives of this alternative, but not the positives (or the juxtaposed negatives of the main proposal).
I'll try to list the benefits.
It might look weird and confusing that there is two different ways to use IPAM claims: NSE for secondary network but a different annotation for primary networks; when we could just go for what we are already doing.
I'm not so sure it fits, or that it wouldn't look weirder than the current proposal. Please keep reading.
It is highly probable that we will need NSE to customize the primary UDN network in the future, just as the default network can be customized today.
I agree it might be needed to customize the primary UDN network.
I also agree there is currently a way to customize the cluster default network - via the multus annotation.
Maybe I just don't understand how this would look like - I'm assuming the NSE would have to request the attachment of the default network. For clarity, could you paste how you envision the NSE for - let's say - a primary UDN requesting an IPAMClaim and SR-IOV device ?
Assuming I've understood this correctly (and the NSE we need to configure will point to the default network attachment), I think it is weird to put an NSE customizing the default network, and for it to really apply / affect the UDN attachment (the IPAMClaim reference, SR-IOV device). After all, you'll need to pass the data corresponding to the default network, but pass the parameters you want to customize for the primary UDN. That, IMHO, doesn't make for a good fit in this particular case for the API. I can argue it's more confusing.
Still, we're evaluating it, and PoC'ing how would it really look.
When that day comes, everything will look even weirder because then you would be passing a NSE and you will be asking yourself on which of the two annotations the ipam claim reference has to go. Best case, we support and be forced to maintain both options.
Hopefully it won't come to that; when we release this feature in 4.18 we'll have picked one option, and IIUC we would support just that one. I.e. whatever option we end choosing will be the way to specify this feature.
Moreover, I think there is really no specific reason we don't allow NSE for primary UDNs at this point.
I think we simply wanted to make it impossible for users to specify a primary UDN as a secondary network.
I guess this could be relaxed to allow it when said NSE is selected as the default
via the multus annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I just don't understand how this would look like - I'm assuming the NSE would have to request the attachment of the default network. For clarity, could you paste how you envision the NSE for - let's say - a primary UDN requesting an IPAMClaim and SR-IOV device ?
Assuming I've understood this correctly (and the NSE we need to configure will point to the default network attachment), I think it is weird to put an NSE customizing the default network, and for it to really apply / affect the UDN attachment (the IPAMClaim reference, SR-IOV device). After all, you'll need to pass the data corresponding to the default network, but pass the parameters you want to customize for the primary UDN. That, IMHO, doesn't make for a good fit in this particular case for the API. I can argue it's more confusing.
It doesn't look weirder that using the default network name to refer to the primary network status in the network status annotation, as we plan to do, from ovn-kubernetes/ovn-kubernetes#4671:
[
{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.244.1.6",
"fd00:10:244:2::6"
],
"mac": "0a:58:0a:f4:01:06",
"default": false,
"dns": {}
},
{
"name": "ovn-kubernetes",
"interface": "ovn-udn1",
"ips": [
"10.128.0.3"
],
"mac": "0a:58:0a:80:00:03",
"default": true,
"dns": {}
}
]
So at this point we have to come to terms that from the perspective of this API, when addressing the default network name, we might actually be addressing the default network for the namespace whatever that might be.
I am not really sure there is a valid argument to say its fine to do it in network-status
but not in the NSE in terms of API (or UX).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look weirder that using the default network name to refer to the primary network status in the network status annotation, as we plan to do, from ovn-org/ovn-kubernetes#4671:
[ { "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.244.1.6", "fd00:10:244:2::6" ], "mac": "0a:58:0a:f4:01:06", "default": false, "dns": {} }, { "name": "ovn-kubernetes", "interface": "ovn-udn1", "ips": [ "10.128.0.3" ], "mac": "0a:58:0a:80:00:03", "default": true, "dns": {} } ]
I disagree; to me, it totally does - they are in different elements in the array. The only thing confusing there is the network name; and you can still differentiate using the interface name.
IMHO that's totally different from using an NSE like this one:
[
{
"name":"default", # relates to a NAD with the default cluster network
"namespace":"blue-ns",
"ipam-claim-reference": "as-we-have.now" # customizes the primary UDN attachment
}
]
In case I am making an error with the NSE contents, please let me know; I've asked for it in #1673 (comment), but you probably didn't see it.
So at this point we have to come to terms that from the perspective of this API, when addressing the default network name, we might actually be addressing the default network for the namespace whatever that might be.
I will give it another go at the interface reporting PR to see if I can actually reply with the network name of the NAD for the primary UDN. If I manage to, I'd say your comment wouldn't apply any more.
EDIT: I don't think the above is doable.
I am not really sure there is a valid argument to say its fine to do it in
network-status
but not in the NSE in terms of API (or UX).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree; to me, it totally does - they are in different elements in the array. The only thing confusing there is the network name; and you can still differentiate using the interface name.
Then we agree we disagree.
That's not the only confusing thing. The fact that "default": true
is set for the network that is not the default network is also confusing.
That fact that the array has two elements and therefor its fine it's a tunnel vision specific to your use case, it doesn't mean it is API fine.
Using interface name to distinguish relies on the unspecified convention that eth0
is the main interface, that wouldn't be good either.
We agreed we had to redefine what default
means, and maybe the NSE on the default network affecting the primary network fits under that new definition.
However, the NSE might as well be defined addressing the primary UDN network name, nothing would be preventing us being able to handle it as long as OVNK is handling the CNI call at the end, it doesn't really matter if the CNI call is for the primary or the default network, we can be aware and handle it.
Then the network status would be the opposite, and will report twice the primary network. I guess that is fine as well if it is fine the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying you couldn't; I'm pointing out that we'd need to reverse the flow of operations. At least in that we can agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying in this thread to some of your comments.
That's not the only confusing thing. The fact that "default": true is set for the network that is not the default network is also confusing.
This aligns with your next comment - there is a need to redefine what "default" means.
We agreed we had to redefine what default means, and maybe the NSE on the default network affecting the primary network fits under that new definition.
The work-group discussed "default" would be up to the implementation to define; that's the reason why it is now proposed for the interface order to matter in that array - you want to have a different default ? Put that first in the list.
@dougbtv could you keep me honest in this last comment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to keep this discussion short, I think it is feasible to use the NSE and reasonable to highlight the positives of that option in the alternative. That would be my whole ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to keep this discussion short, I think it is feasible to use the NSE and reasonable to highlight the positives of that option in the alternative. That would be my whole ask.
I agree that it is feasible, and have committed to experiment with it already. We're currently PoCing it.
I've also said I would highlight the positives of it in the alternative section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
That's fine but if you don't see the confusion in saying that default is the one that comes first in the list while it has I hope I am making my point across here. It's like, yeah its fine to do this here but not here because we can't apply the same interpretation that makes it fine in the former to make it fine in the latter.
Not sure if I understand what you mean. The way I am thinking about is that once we recognize that we got a request for a primary UDN, the we could just do the exact same thing we do now. |
Wait - no. There's some confusion here. The default network would be first interface reported in the CNI result. Not in the
Then (maybe it's just me ...) I find it confusing to define in the NSE information for the primary UDN interface but point to the default cluster network (via the name attribute). Yes - I find that confusing: it's within the same element in an array. It's very structure makes me think it points to the same object. And I find it not as bad (which is not saying I find it absolutely fine) to return multiple interfaces with the same name in the status. Because those "live" in different elements in the array, and can be identified by the name of the interface. In this scenario, the structure of the data makes me interpret it as belonging to different things. Maybe I'm silly by thinking this is not as bad.
I think I understand your point. You see reporting multiple interfaces with the same name as exactly the same as customizing the primary UDN while pointing to the default cluster network in the NSE.
Yes. |
is responsible for configuring networking up to the pod interface), we still | ||
need to extend connectivity from the pod interface into the VM. | ||
|
||
During live-migration - once it is is scheduled - and the destination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: double is
.
the *destination* pod). Once this pod is ready, the *source* pod transfers the | ||
state of the live VM to the *destination* pod via a connection proxied by the | ||
virt-handlers (the KubeVirt agents running on each node) to the *destination* | ||
pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to add here some points about how this works as it has implications on the solution:
- The destination pod is pretty much identical to the source pod and both pods are expected to be in a "running" state for the period of the migration (which may take tens of seconds).
- During the migration time, only the source VM is active and running, the destination one is being prepared. At the very end, the source VM is paused and the destination started.
- The source pod may still be around for some time before finally removed post a successful migration. I.e. the VM is already stopped on the source, even removed, but the pod may still be in a teardown process waiting for other dependencies to be cleaned up.
|
||
KubeVirt uses a concept called bind mechanisms to extend networking to the VM; | ||
we plan on using the | ||
[passt](https://passt.top/passt/about/#passt-plug-a-simple-socket-transport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it is enough to summarize here only that a dedicated binding is chosen, but not providing too much details. This should allow us to change the details without the need to update this proposal.
But this is up to the author how to manage this.
To allow the user to specify the IP address for the workload we plan on adding | ||
an attribute to the KubeVirt API | ||
[interface](https://kubevirt.io/api-reference/main/definitions.html#_v1_interface) | ||
named `ipAddress`. This approach suits the product, and meets the user's | ||
expectations because the user can configure the MAC address for an interface in | ||
that same `Interface` resource. | ||
|
||
Regarding the implementation, OpenShift Virtualization would extend what it | ||
does for the MAC address: define it in the k8snetworkplumbingwg de-facto | ||
standard as an IP request, and the OVN-Kubernetes CNI would attempt to honor | ||
it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubevirt is agnostic to IPAM in terms of definition. It only reflects the status of the interfaces IP details.
Unlike the MAC address, which controls the emulated VM vNIC, the IP is is an external service.
The current solution with the IPAMClaim
is also an external add-on to Kubevirt and I would prefer this to be continued.
There is also the topic that was recently raised about the necessity and priority of this requirement. If we can add this as an optional future enhancement, it will be great.
This other alternative has the advantage of not requiring a KubeVirt API update | ||
but adds a lot more work in OVN-Kubernetes, and requires updating the de-facto | ||
standard. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the suggested alternative is not limited to a cluster admin, it is just a matter of how it is eventually implemented. IMO the concept itself is the most important part, managing IP pools and ranges in a centralized location and treat individual specific VM control as an exception (e.g. a pool of a single IP in it, but there are probably many other ways to approach the need).
If the tests show that the traffic disruption takes too long, we should | ||
prioritize the | ||
[seamless live-migration epic](https://issues.redhat.com/browse/CNV-27147) | ||
which makes use of the multi requested-chassis OVN feature, in which a single | ||
logical switch port can be bound to multiple chassis (and have traffic | ||
forwarded to them). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective (representing Kubevirt), this is a must for a seamless migration.
The amount of packets that may be dropped is dependent on the time it takes for the source pod to be removed, something that is dependent on many factors in the Kubevirt context.
If you are interested in testing, I suggest to examine this with setups that are expected to cause delays in the pod teardown.
The Kubevirt design proposal that suggests the binding to be used, mentions this one as a pre-requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seamless migration is not a goal - live migration is.
I cannot prioritize that epic; PMs must.
FWIW: I think we can tackle this feature in the 4.18 time-frame if we have SDN team help.
It would be especially interesting to know how would the "feature" be configured, since the exact same concept can be used for port-mirroring.
|
||
### Dev Preview -> Tech Preview | ||
|
||
There will be no dev or tech preview for this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very big risk. How about we mention this in the risk section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can, but I don't think this is debatable.
And I feel stupid listed something as a risk and not provide any mitigation for it.
enhancements/network/primary-user-defined-networks-for-virtualization-workloads.md
Show resolved
Hide resolved
|
||
## Alternatives | ||
|
||
### Binding mechanism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section will need to sync with the U/S Kubevirt proposal.
Hopefully that proposal will be posted to the Kubevirt community in a few days with the plan that fits the technical aspects, maintainability and the timeline limitations.
integration. This would leave service mesh / ACS / metric integration as the | ||
drawback of this option. | ||
|
||
### VM interface IP address configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest this and its need to be reconsidered or mentioned as futuristic optional.
If this need will still be targeted for 4.18 or a future version, I think there are other options to manage the IP pools which are closer to an add-on service, rather than using the IPAMClaim to solve a per-VM-vNIC IP manual allocation.
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
89d3362
to
580f516
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews so far.
Virtualization users are commonly used to having layer2 networks, and to having | ||
their tenant networks isolated from other tenants. This is something that | ||
opposes the Kubernetes networking model, in which all pods can reach other pods, | ||
and security is provided by implementing Network Policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
While this would be a native way to integrate with the rest of the persistent | ||
IPs feature, it would be a clumsy flow for primary UDNs, since the user would | ||
be expected to use network selection elements to configure the primary UDN. | ||
|
||
Furthermore, we currently not allow primary UDNs to be attached via network | ||
selection elements. We would need to break this convention. | ||
|
||
This route could be interesting in the future though, since we expect this to | ||
become the way to request SR-IOV resources for primary UDNs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
If the tests show that the traffic disruption takes too long, we should | ||
prioritize the | ||
[seamless live-migration epic](https://issues.redhat.com/browse/CNV-27147) | ||
which makes use of the multi requested-chassis OVN feature, in which a single | ||
logical switch port can be bound to multiple chassis (and have traffic | ||
forwarded to them). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seamless migration is not a goal - live migration is.
I cannot prioritize that epic; PMs must.
FWIW: I think we can tackle this feature in the 4.18 time-frame if we have SDN team help.
It would be especially interesting to know how would the "feature" be configured, since the exact same concept can be used for port-mirroring.
|
||
KubeVirt uses a concept called bind mechanisms to extend networking to the VM; | ||
we plan on using the | ||
[passt](https://passt.top/passt/about/#passt-plug-a-simple-socket-transport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just link the binding enhancement. Simpler.
|
||
Provisioning the VM in the system will be "watched" by two components; KubeVirt | ||
`virt-controller` and KubeVirt `ipam-extensions`. These components will then: | ||
- virt-controller: template the pod where the VM `vm-a` will run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
### Dev Preview -> Tech Preview | ||
|
||
There will be no dev or tech preview for this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can, but I don't think this is debatable.
And I feel stupid listed something as a risk and not provide any mitigation for it.
enhancements/network/primary-user-defined-networks-for-virtualization-workloads.md
Show resolved
Hide resolved
api-server -->>- kubevirt-ipam-extensions: OK | ||
end | ||
rect rgb(0, 255, 255) | ||
Note over kubevirt-ipam-extensions: VM / VMI controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the KubeVirt VM / VMI controllers.
We have a controller on the ipam-extensions project that tracks the VM lifecycle (creates IPAMClaims on behalf of the user for instance).
I'm just trying to point out this does not run in webhook, unlike the section just above this one.
- cloudInitNoCloud: | ||
networkData: | | ||
version: 2 | ||
ethernets: | ||
eth0: | ||
dhcp4: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; good suggestions from both of you.
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
…v6 / RAs Signed-off-by: Miguel Duarte Barroso <[email protected]>
We realized we do not understand the use case for allowing the user to specify the IP address for their VMs. Thus, we have postponed that goal to a future release, while re-scoping the goal for this one to be about persisting the IP address of the VM migrating from another virtualization platform into OpenShift Virtualization. Signed-off-by: Miguel Duarte Barroso <[email protected]>
…ction Signed-off-by: Miguel Duarte Barroso <[email protected]>
Signed-off-by: Miguel Duarte Barroso <[email protected]>
13150f6
to
4d8e6fd
Compare
@maiqueb: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This enhancement discussed 3 items:
The first item was assigned to another OpenShift virt team, and is present in kubevirt/community#325. The second item was chosen, and implemented. It is already available in the product. The third item fell out of 4.18 scope. Hence, closing this enhancement proposal. |
This PR provides an OpenShift enhancement to describe the virtualization specific required integrations for the primary UDN feature, which was designed in #1623 .
It builds on top of enhancement #1456, in which we design how to have persistent IPs for VM workloads secondary interfaces of layer2 or localnet type.
The overall goal is to provide OpenShift Virtualization workloads the ability to have live-migration on UDN networks; UDN networks are interesting for virt users since they are managed (have IPAM) and are closer to what a virt. user expects - isolated networks - instead of following the Kubernetes networking model, in which all pods can communicate between themselves, and connectivity is limited via Network Policy.