Skip to content

KEP-5328: Node Capabilities #5347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pravk03
Copy link

@pravk03 pravk03 commented May 28, 2025

  • One-line PR description: Add the initial KEP for KEP 5328: Node Capabilities
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 28, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @pravk03!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @pravk03. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 28, 2025
@pravk03 pravk03 marked this pull request as draft May 28, 2025 00:47
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2025
@pravk03 pravk03 force-pushed the node-capabilities branch 2 times, most recently from 59e7e54 to 4719180 Compare May 28, 2025 00:59
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravk03 pravk03 force-pushed the node-capabilities branch 3 times, most recently from 4c11e06 to 9254f9b Compare May 28, 2025 23:11
@pravk03 pravk03 changed the title KEP-5328: Node Capability Aware Scheduling KEP-5328: Node Capabilities May 28, 2025
@pravk03 pravk03 marked this pull request as ready for review May 28, 2025 23:14
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp May 28, 2025 23:14
@pravk03
Copy link
Author

pravk03 commented May 29, 2025

/cc @tallclair @yujuhong

@pravk03 pravk03 force-pushed the node-capabilities branch from 9254f9b to f8291a4 Compare May 29, 2025 01:06
@sanposhiho
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 29, 2025
@pravk03 pravk03 force-pushed the node-capabilities branch from 31f7ade to b90c0d0 Compare June 17, 2025 06:57

* Validate that the kube-scheduler plugin filters nodes based on `node.status.capabilities` when the feature is enabled, and ignores the field when the feature is disabled.
* Validate that `node.status.capabilities` is correctly populated when the feature is enabled, and the field is cleared from the `Node` object when the feature is disabled.
* Validate that the Admission Controller correctly fetches and validates requests against capabilities when the feature is enabled, and does not block requests if the feature is disabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all good tests, but these are feature tests - not enablement/disablement.

Enablement/disablement is a test that (as stated in the comment in the template above) that switches the feature gate in the middle of the test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please take a look.


Yes. The size of the Node object is expected to increase as more capabilities are introduced. The number of capabilities exported will be limited by strategies such as:
1. Automatically handling feature graduation, which includes ceasing to export a capability once it matures or is no longer needed.
2. Exporting only configurations that are relevant to the control plane.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

friendly ping

@pravk03 pravk03 force-pushed the node-capabilities branch from b90c0d0 to 4beba06 Compare June 17, 2025 18:50
@SergeyKanzhelev
Copy link
Member

We discussed this KEP yesterday on a call. Some notes from that call:

Many examples in this KEP of past enhancements that might have used or may use capabilities are not accurate.

  1. For example, LMSConfig will unlikely be using capabilities as a way to understand which LMSConfig is enabled. Reasons being that AppArmor will also want to know that the specific profile is installed. So some sort of LMSConfig object would be a better design. Also many DaemonSets today use both profiles and kubelet just picks whatever is applicable. This will be much harder if LMSConfig will be introduced as capability - each DaemonSet will need to be declared in 3 shapes - one with SeLinux, one with AppArmor, one without either.

  2. Swap is not a good fit for capabilities as a way to discover that the Node has swap configured. Even though it may be an easy way to "avoid API review and introduce a capability", capability is very limiting in it's functionality. For swap discoverability, swap-specific node status (allocatables?) is a better option.

  3. Runtime handlers as a list of handlers is also not a good fit. Default handler runc is not specified in pod spec. So it will not be used by scheduler and by definition must not be added to capabilities. Non-default handlers may need more details on what it is. And names list may not fit into the value length limits. Special object representing the runtime is a better choice here.

Examples where capabilities are useful are:

  1. Feature gates with the specific field in Pod Spec. Like Sidecar, PodLevelResources, etc. Basically, discoverability whether the new filed will work on a given kubelet.
  2. Capability like feature.kubernetes.io/guaranteedQOSPodCPUResize representing the fact that the Feature Gate while in alpha or beta had a certain limitaiton before and now this limitation was lifted. Often it is lifted with the new FG, but not always.
  3. Container runtime missing APIs like for user namespaces support. This also related to UserNamespace capability. And the k8s expectation is that soon ALL nodes will support UserNamespaces. So capabilitty has a lifetime bound to the FG.

We discussed that examples above may be often solved for individual vendors (which control the list of enabled FG per node version) by introducing the semver-base node selector. But capabilities for sure provide way better API for this.

I would suggest in this KEP:

  1. Remove any mention on non-FG related capabilities from the readme, unless there is a good example that can be articulated and explained why capability is a good fit there.
  2. Add a note that the capability is a part of API and requires API review. In k/k codebase we will need to protect the list of capabilities with the api-approvers OWNER file.
  3. Unless we find good examples, let's state that capabilities have a lifetime and we do not expect any long-lived capability.
  4. If we are limiting capabilities to feature gate related features, maybe we should rename capabilities to featureGates to avoid reusing it for long term capabilities long term.

We also discussed that capabilities must be applied to DaemonSets with no exceptions.

I also want to see something explaining how capabilities and Cluster Autoscaler will work together.

@pravk03
Copy link
Author

pravk03 commented Jun 17, 2025

Thanks a lot @SergeyKanzhelev for the discussion and the feedback.

I am okay with most of the above suggestions and I will address them in the KEP. I has some thoughts regarding naming.

If we are limiting capabilities to feature gate related features, maybe we should rename capabilities to featureGates to avoid reusing it for long term capabilities long term.

  • While our initial examples used to demonstrate the functionality are tied to feature gates, renaming it to featureGates would be too restrictive for future use cases.
  • A capability should represent a logical use case, which could enabled by a single feature gate, but it could also be a combination of multiple feature gates plus specific configurations. featureGates wouldn't accurately represent such capabilities.

I am definitely open to naming suggestions, but I believe the name should be broad enough to accommodate future use-cases without requiring a new API field down the road.

@pravk03 pravk03 force-pushed the node-capabilities branch 3 times, most recently from ead37d7 to 419d78a Compare June 18, 2025 02:59
@ajaysundark
Copy link

  1. Swap is not a good fit for capabilities as a way to discover that the Node has swap configured. Even though it may be an easy way to "avoid API review and introduce a capability", capability is very limiting in it's functionality. For swap discoverability, swap-specific node status (allocatables?) is a better option.

Referring my earlier reply on this discussion comment -

For swap, node-capability is much needed for 'placement-control' to protect a latency-sensitive pod is never scheduled on a swap-enabled node.
Scheduler control for swap needs two questions:

  1. whether workload needs swap (new api for swap preference from pod-spec)
  2. whether node is swap configured

A swap-capability will provide the signal for (2), allowing for simple and clear scheduling rules.

Alternatives like 'NFD' exists for detecting swap on a node. But it is out-of-tree and not aware of the Kubelet's specific swap configuration.

@SergeyKanzhelev
Copy link
Member

I am definitely open to naming suggestions, but I believe the name should be broad enough to accommodate future use-cases without requiring a new API field down the road.

Can we have any examples listed that will justify this. Right now the KEP suggests to use it for FG-related capabilities, while not giving a good examples where it would be non-FG related.

@pravk03 pravk03 force-pushed the node-capabilities branch from 419d78a to 5fb093d Compare June 18, 2025 17:45
@macsko
Copy link
Member

macsko commented Jun 18, 2025

The scheduling part looks good for alpha
/approve as SIG Scheduling

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: macsko, pravk03
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pravk03
Copy link
Author

pravk03 commented Jun 18, 2025

Can we have any examples listed that will justify this. Right now the KEP suggests to use it for FG-related capabilities, while not giving a good examples where it would be non-FG related.

The guaranteedQOSPodCPUResize example used in the KEP isn't purely a feature gate; it's a logical capability derived from a combination of feature gates and the Kubelet's cpuManagerPolicy configuration.

While this is still in early stages, this recent discussion about making the pod requirement for exclusive resources more explicit also indicates a need for non-FG capabilities. The API field itself should be forward-facing enough to support such potential use-cases ?.

@SergeyKanzhelev
Copy link
Member

Can we have any examples listed that will justify this. Right now the KEP suggests to use it for FG-related capabilities, while not giving a good examples where it would be non-FG related.

The guaranteedQOSPodCPUResize example used in the KEP isn't purely a feature gate; it's a logical capability derived from a combination of feature gates and the Kubelet's cpuManagerPolicy configuration.

While this is still in early stages, this recent discussion about making the pod requirement for exclusive resources more explicit also indicates a need for non-FG capabilities. The API field itself should be forward-facing enough to support such potential use-cases ?.

Those are all examples of FG-related capabilities. Not the generic long-term capabilities.

@pravk03 pravk03 force-pushed the node-capabilities branch from 5fb093d to a3e1436 Compare June 18, 2025 20:36
@tallclair
Copy link
Member

It seems like most of the concerns with this are around the specific capabilities being added, but this KEP doesn't actually propose adding any capabilities. The examples given are hypothetical examples based on features currently in development, but no new features will be able to depend on capabilities until it goes to beta. This creates a bit of a chicken-and-egg situation, where it's hard to point to exactly how capabilities will be used until we have users lined up, but we can't line up users yet.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Jun 18, 2025

It seems like most of the concerns with this are around the specific capabilities being added, but this KEP doesn't actually propose adding any capabilities. The examples given are hypothetical examples based on features currently in development, but no new features will be able to depend on capabilities until it goes to beta. This creates a bit of a chicken-and-egg situation, where it's hard to point to exactly how capabilities will be used until we have users lined up, but we can't line up users yet.

we kind of need to know what will be expected use cases. Maybe past examples or hypothetical examples thought thru end-to-end. Right now this KEP is limited to just set of name/value pairs and a scenario of FG discoverability. But already we are thinking there MAY be need to support capabilities for node selection, ability to declare tolerations for capabilities, ability to have node-restricted capabilities. Knowing the scope would help to understand if API proposed is needed (among alternatives if the set of use cases is limited) and if needed, what shape should it have.

@pravk03 pravk03 force-pushed the node-capabilities branch from a3e1436 to f069f62 Compare June 18, 2025 23:55
@pravk03 pravk03 force-pushed the node-capabilities branch 2 times, most recently from a3dd053 to 8d6230d Compare June 19, 2025 07:35
@pravk03 pravk03 force-pushed the node-capabilities branch from 8d6230d to cd6d67e Compare June 19, 2025 17:18
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 19, 2025
@pravk03
Copy link
Author

pravk03 commented Jun 19, 2025

Maybe past examples or hypothetical examples thought thru end-to-end

RuntimeClass was intended as a past example used to illustrate non-FG related runtime capabilities in the earlier version of the proposal. I agree that it had some missing details and thanks for highlighted them in your comment.

  1. Runtime handlers as a list of handlers is also not a good fit. Default handler runc is not specified in pod spec. So it will not be used by scheduler and by definition must not be added to capabilities. Non-default handlers may need more details on what it is. And names list may not fit into the value length limits. Special object representing the runtime is a better choice here.

I have tried to address these the Case Study section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.