Skip to content

KEP-4381: Define standard device attributes #5316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

gauravkghildiyal
Copy link
Member

@gauravkghildiyal gauravkghildiyal commented May 18, 2025

  • One-line PR description: This PR introduces one standard device attribute: kubernetes.io/pcieRoot. Standard attributes will allow users to request hardware topology alignment for their workloads, even when devices are managed by different DRA drivers.
  • Other comments: None

/wg device-management
/assign @johnbelamaric
/assign @klueska
/assign @pohly

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label May 18, 2025
@k8s-ci-robot k8s-ci-robot requested a review from dchen1107 May 18, 2025 23:39
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 18, 2025
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 18, 2025
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

The motivation is clear. I was worried that it would be hard to define the exact meaning of a hardware topology description, but punting that problem to the Linux kernel makes that a problem for someone else and not worse than what exists already...

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 19, 2025
capacities for standardization by the Kubernetes project. This reservation
allows us to define common attributes that can describe hardware characteristics
across resources from different vendors. Currently, we are defining two such
standard attributes: `k8s.sio/numaNode` and `k8s.io/pcieRoot`. Details on their
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
standard attributes: `k8s.sio/numaNode` and `k8s.io/pcieRoot`. Details on their
standard attributes: `k8s.io/numaNode` and `k8s.io/pcieRoot`. Details on their

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Corrected :)

Copy link
Member

@kad kad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't relay on numa_number in sysfs hierarchy. It is broken on modern hardware.

// Access (NUMA) node within the system's NUMA topology. This attribute can
// be used to describe which NUMA node a device is physically associated
// with. DRA drivers MAY discover this value for PCI devices via
// sysfs, for example, by reading `/sys/bus/pci/devices/<PCI_ADDRESS>/numa_node`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUMA in that particular place of sysfs reported by kernel does not represent real hardware topology in case of SNC (Intel) or NPS (AMD) active. Please don't relay on that.
I'm ok with standardatizing PCIe root attribute, but not NUMA.

Another reminder: NUMA represent only memory zone/mode of operation of Memory controller, and it has nothing to do with PCIe bandwidth or CPU core to device alignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Let's focus just on pcieRoot for now. The GA criteria only requires that we standardize on one, so let's start with that. We can revisit NUMA and / or other fields later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great feedback, thanks for pointing that out @kad. I certainly wasn't familiar with this nuance. Given this understanding, I agree with "NUMA represent only memory zone/mode of operation of Memory controller" and that NIC alignment through NUMA node may not really be useful.

The two main kinds of alignment that I'm trying to find a solution for are:

  • In a multi-socket system, I want devices that belong to the same socket (like a CPU and it's associated NIC)
  • In case of multiple PCIe domains (accounting for the fact that maybe a modern single CPU socket can have multiple PCIe domains), I want devices from the same PCIe hierarchy (like a GPU and a NIC)

I think that tackling these two alignment challenges right from the start should cover the vast majority of use cases. @klueska I absolutely agree with you while the GA criteria just asks us to standardize one attribute, but if you and other reviewers here feel that k8s.io/cpuSocketNumber is also clear enough for most vendors to use, it would be fantastic if we could get both of these important things addressed from the get-go.

Copy link
Member

@kad kad May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, even within socket, PCIe roots might have different distance to particular core depending on the vendor and/or generation of the hardware. There is no standard here. Some of such explained here: https://sched.co/1i7ke

If we are talking about devices, key part is the PCIe bus. All other components in the system should be declaring to which bus they are related to, not making assumptions that "within socket we are all ok".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great presentation, thanks for sharing.

I kinda agree with the idea that for "normal" devices (anything part of the PCIe hierarchy), sure it's far better if they specify the PCIe root, so that IF the end user wishes, they can request devices within the same PCIe root.

But a user could also request a pairing of a "CPU request together with a NIC", which becomes especially important in multi-CPU systems. As I understand, the community may potentially works towards a CPU DRA Driver, wherein CPU/Memory resources can also be available as some form of quantized DRA Devices.

For such a scenario, the user should be able to express something like "give me a CPU and a NIC within that CPU" (I've updated the Use Case example in this KEP for the same). Since the CPU is not within the PCIe hierarchy (but rather a parent of the PCIe hierarchy), we need to allow something beyond PCIe that allows us to group things that all are part of, or belong to a single CPU. Such a use case allows users to AVOID allocation of CPU from one socket and the NIC attached to the other CPU, thereby needing to go through the QPI Links, which as I understand can be a bottleneck.

I'd agree that while "PCIe roots might have different distance to particular core" but for a pairing like a CPU and it's NIC, being able to say "within socket we are all ok" is still better than having communication that crosses the QPI (or inter-cpu) interconnects

Thoughts?

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that we should not use the NUMA index as an attribute.
However, I think we still need a standard device attribute for the physical socket.
In the architectures above, I need a k8s.io/cpuSocketNumber to select four {X}PUs under the same CPU socket. Here, k8s.io/pcieRoot doesn't work.
For the dual-root configuration, every two {X}PUs are connected to separate PCIe roots.
For the direct-attached configuration, each {X}PU is connected to its own root port.
For more contexts, please see this document.

Screenshot 2025-05-20 at 8 05 54 PM
Screenshot 2025-05-20 at 8 06 06 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is, that CPU hierarchy, I/O hierarchy and memory hierarchy are different entities. in overall, there is no reliable way to standardise that across vendors or generations. If we start to try to put some common denominator, we will end-up to the same problem as we have with device plugins that are using numa number as "common" index.
Practically, if someone have really specific needs on alignment, it would be vendor specific attributes sooner than later. Because those will be really generation/model specific. For basic things, PCIe root is good enough to align devices. If the DRA driver is more aware of some PCIe switches inbetween or some specifics like in BG's picture above where 4 roots are "equal" in a sense, then it can expose some vendor specific attribute that can be used for selection instead of exact PCIe root number. For CPUs, if someone really want to expose it via DRA, groups of CPUs can declare locality to one or more PCIe roots. However, I believe DRA for CPUs is not the best idea, as based on the practical experiments on different allocation algorithms, it is a lot cheaper to migrate workloads based on actualt needs and current state of the system, rather than trying to predict future state of the system (taking native resources into account). There is also another aspect, that both CPUs and memory are not really "flat" resources, they are hierarchical. These sentences "it will be ok within socket" is excatly traversal via tree-like hierarchy: if you can't find perfect solution, next layer up will be still ok... but not always. Inter-socket links nowadays have huge bandwidth, so for majority of workloads they are not going to be bottlenecks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversations like this are precisely why I think we should start with just the only one we can all agree on -- pcieRoot. The GA criteria clearly state that we only need one standardized attribute as a prerequisite for graduating DRA to GA and that's the most important thing right now.

I'd suggest modifying this PR to only add the single attribute of k8s.io/pcieRoot (so we can merge it now) and open a separate PR to continue the conversation about additional attributes (which could still be merged in this release cycle if we can come to an agreement, but wont block this one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'm okay with adding just pcieRoot as the standard attribute for faster GA promotion.
Let's discuss the cpuSocket attribute separately in another issue or PR.
For reference, in the DRA driver I'm working on, the CPU socket index is exposed using an attribute called {privateDomainPrefix}/cpuDomain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska Yes definitely, in the broader interest of unblocking GA requirements, lets move forward with pcieRoot for now. I've updated the PR to reflect the same.

I'll send a separate PR for cpuSocketNumber and reference this discussion. I still believe do need cpuSocketNumber as one of the initial standard attributes and would try to get it in before the KEP freeze.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gauravkghildiyal
Once this PR has been reviewed and has the lgtm label, please ask for approval from johnbelamaric. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2025
@gauravkghildiyal gauravkghildiyal force-pushed the kep-4381-dra-standard-attributes branch from 15077fb to 10cdfb3 Compare May 20, 2025 03:53
@johnbelamaric
Copy link
Member

/cc @bg-chun

@johnbelamaric
Copy link
Member

johnbelamaric commented May 21, 2025

cc @pravk03 @aojea

If we don't standardize a CPU socket attribute, we may need to have a way for the DRANET and CPU drivers to be configured to publish one under a private name (e.g., gke.google.com)

@gauravkghildiyal gauravkghildiyal force-pushed the kep-4381-dra-standard-attributes branch from aaed901 to 7dda2e0 Compare May 27, 2025 01:52
@@ -568,7 +568,7 @@ spec:
version: 11.1.42
powerSavingSupported:
bool: true
dra.k8s.io/pciRoot: # a fictional standardized attribute, not actually part of this KEP
k8s.io/pcieRoot:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we commit to the label, do we use k8s.io or kubernetes.io ?

the later is the one used in most of the well known labels https://kubernetes.io/docs/reference/labels-annotations-taints/

@johnbelamaric @klueska @gauravkghildiyal @pohly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use kubernetes.io.

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#label-selector-and-annotation-conventions:

"kubernetes.io" is the preferred form for labels and annotations, "k8s.io" should not be used for new map keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. The guidelines seem pretty clear. Changed.

@aojea
Copy link
Member

aojea commented May 27, 2025

/lgtm

based on the existing guidelines #5316 (comment) and the impressive technical discussions #5316 (comment) this attribute seems to be non-contentious and useful for all the different community stakeholders

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

9 participants