-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-4381: Define standard device attributes #5316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-4381: Define standard device attributes #5316
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
The motivation is clear. I was worried that it would be hard to define the exact meaning of a hardware topology description, but punting that problem to the Linux kernel makes that a problem for someone else and not worse than what exists already...
capacities for standardization by the Kubernetes project. This reservation | ||
allows us to define common attributes that can describe hardware characteristics | ||
across resources from different vendors. Currently, we are defining two such | ||
standard attributes: `k8s.sio/numaNode` and `k8s.io/pcieRoot`. Details on their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standard attributes: `k8s.sio/numaNode` and `k8s.io/pcieRoot`. Details on their | |
standard attributes: `k8s.io/numaNode` and `k8s.io/pcieRoot`. Details on their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Corrected :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't relay on numa_number
in sysfs hierarchy. It is broken on modern hardware.
// Access (NUMA) node within the system's NUMA topology. This attribute can | ||
// be used to describe which NUMA node a device is physically associated | ||
// with. DRA drivers MAY discover this value for PCI devices via | ||
// sysfs, for example, by reading `/sys/bus/pci/devices/<PCI_ADDRESS>/numa_node`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NUMA in that particular place of sysfs reported by kernel does not represent real hardware topology in case of SNC (Intel) or NPS (AMD) active. Please don't relay on that.
I'm ok with standardatizing PCIe root attribute, but not NUMA.
Another reminder: NUMA represent only memory zone/mode of operation of Memory controller, and it has nothing to do with PCIe bandwidth or CPU core to device alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Let's focus just on pcieRoot
for now. The GA criteria only requires that we standardize on one, so let's start with that. We can revisit NUMA and / or other fields later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great feedback, thanks for pointing that out @kad. I certainly wasn't familiar with this nuance. Given this understanding, I agree with "NUMA represent only memory zone/mode of operation of Memory controller" and that NIC alignment through NUMA node may not really be useful.
The two main kinds of alignment that I'm trying to find a solution for are:
- In a multi-socket system, I want devices that belong to the same socket (like a CPU and it's associated NIC)
- In case of multiple PCIe domains (accounting for the fact that maybe a modern single CPU socket can have multiple PCIe domains), I want devices from the same PCIe hierarchy (like a GPU and a NIC)
I think that tackling these two alignment challenges right from the start should cover the vast majority of use cases. @klueska I absolutely agree with you while the GA criteria just asks us to standardize one attribute, but if you and other reviewers here feel that k8s.io/cpuSocketNumber
is also clear enough for most vendors to use, it would be fantastic if we could get both of these important things addressed from the get-go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, even within socket, PCIe roots might have different distance to particular core depending on the vendor and/or generation of the hardware. There is no standard here. Some of such explained here: https://sched.co/1i7ke
If we are talking about devices, key part is the PCIe bus. All other components in the system should be declaring to which bus they are related to, not making assumptions that "within socket we are all ok".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great presentation, thanks for sharing.
I kinda agree with the idea that for "normal" devices (anything part of the PCIe hierarchy), sure it's far better if they specify the PCIe root, so that IF the end user wishes, they can request devices within the same PCIe root.
But a user could also request a pairing of a "CPU request together with a NIC", which becomes especially important in multi-CPU systems. As I understand, the community may potentially works towards a CPU DRA Driver, wherein CPU/Memory resources can also be available as some form of quantized DRA Devices.
For such a scenario, the user should be able to express something like "give me a CPU and a NIC within that CPU" (I've updated the Use Case example in this KEP for the same). Since the CPU is not within the PCIe hierarchy (but rather a parent of the PCIe hierarchy), we need to allow something beyond PCIe that allows us to group things that all are part of, or belong to a single CPU. Such a use case allows users to AVOID allocation of CPU from one socket and the NIC attached to the other CPU, thereby needing to go through the QPI Links, which as I understand can be a bottleneck.
I'd agree that while "PCIe roots might have different distance to particular core" but for a pairing like a CPU and it's NIC, being able to say "within socket we are all ok" is still better than having communication that crosses the QPI (or inter-cpu) interconnects
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree that we should not use the NUMA index as an attribute.
However, I think we still need a standard device attribute for the physical socket.
In the architectures above, I need a k8s.io/cpuSocketNumber to select four {X}PUs under the same CPU socket. Here, k8s.io/pcieRoot doesn't work.
For the dual-root configuration, every two {X}PUs are connected to separate PCIe roots.
For the direct-attached configuration, each {X}PU is connected to its own root port.
For more contexts, please see this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is, that CPU hierarchy, I/O hierarchy and memory hierarchy are different entities. in overall, there is no reliable way to standardise that across vendors or generations. If we start to try to put some common denominator, we will end-up to the same problem as we have with device plugins that are using numa number as "common" index.
Practically, if someone have really specific needs on alignment, it would be vendor specific attributes sooner than later. Because those will be really generation/model specific. For basic things, PCIe root is good enough to align devices. If the DRA driver is more aware of some PCIe switches inbetween or some specifics like in BG's picture above where 4 roots are "equal" in a sense, then it can expose some vendor specific attribute that can be used for selection instead of exact PCIe root number. For CPUs, if someone really want to expose it via DRA, groups of CPUs can declare locality to one or more PCIe roots. However, I believe DRA for CPUs is not the best idea, as based on the practical experiments on different allocation algorithms, it is a lot cheaper to migrate workloads based on actualt needs and current state of the system, rather than trying to predict future state of the system (taking native resources into account). There is also another aspect, that both CPUs and memory are not really "flat" resources, they are hierarchical. These sentences "it will be ok within socket" is excatly traversal via tree-like hierarchy: if you can't find perfect solution, next layer up will be still ok... but not always. Inter-socket links nowadays have huge bandwidth, so for majority of workloads they are not going to be bottlenecks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversations like this are precisely why I think we should start with just the only one we can all agree on -- pcieRoot
. The GA criteria clearly state that we only need one standardized attribute as a prerequisite for graduating DRA to GA and that's the most important thing right now.
I'd suggest modifying this PR to only add the single attribute of k8s.io/pcieRoot
(so we can merge it now) and open a separate PR to continue the conversation about additional attributes (which could still be merged in this release cycle if we can come to an agreement, but wont block this one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'm okay with adding just pcieRoot as the standard attribute for faster GA promotion.
Let's discuss the cpuSocket attribute separately in another issue or PR.
For reference, in the DRA driver I'm working on, the CPU socket index is exposed using an attribute called {privateDomainPrefix}/cpuDomain
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska Yes definitely, in the broader interest of unblocking GA requirements, lets move forward with pcieRoot for now. I've updated the PR to reflect the same.
I'll send a separate PR for cpuSocketNumber and reference this discussion. I still believe do need cpuSocketNumber as one of the initial standard attributes and would try to get it in before the KEP freeze.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gauravkghildiyal The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
15077fb
to
10cdfb3
Compare
/cc @bg-chun |
aaed901
to
7dda2e0
Compare
@@ -568,7 +568,7 @@ spec: | |||
version: 11.1.42 | |||
powerSavingSupported: | |||
bool: true | |||
dra.k8s.io/pciRoot: # a fictional standardized attribute, not actually part of this KEP | |||
k8s.io/pcieRoot: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before we commit to the label, do we use k8s.io
or kubernetes.io
?
the later is the one used in most of the well known labels https://kubernetes.io/docs/reference/labels-annotations-taints/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use kubernetes.io
.
"kubernetes.io" is the preferred form for labels and annotations, "k8s.io" should not be used for new map keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. The guidelines seem pretty clear. Changed.
/lgtm based on the existing guidelines #5316 (comment) and the impressive technical discussions #5316 (comment) this attribute seems to be non-contentious and useful for all the different community stakeholders |
kubernetes.io/pcieRoot
. Standard attributes will allow users to request hardware topology alignment for their workloads, even when devices are managed by different DRA drivers./wg device-management
/assign @johnbelamaric
/assign @klueska
/assign @pohly