[KEP-5282]Add KEP for Implicit Tolerations #5389

cici37 · 2025-06-09T19:59:17Z

One-line PR description:

Issue link: Implicit tolerations #5282

Other comments:

k8s-ci-robot · 2025-06-09T19:59:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cici37
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cici37 · 2025-06-09T20:30:34Z

/assign @dom4ha @johnbelamaric @sanposhiho

mortent · 2025-06-09T21:47:19Z

/wg device-management

sanposhiho · 2025-06-11T18:18:02Z

We will take a look, but very likely we cannot make it by the KEP freeze because our bandwidth is limited and there are other on-going KEPs.

dom4ha · 2025-06-12T00:33:59Z

keps/sig-scheduling/5282-implicit-tolerations/README.md

+
+## Proposal
+
+Introduce a new PreEnqueue scheduler plugin (or extend the existing TaintToleration plugin) that:


Isn't PreEnqueue too early? In case of PrioritizedAlternatives we may have tainted GPU listed as only one of the options. If we put toleration too early, we (very theoretically) could schedule a pod on tainted node but not use that resource for which the node was tainted.

Alternatively, as mentioned before, tolerations could be also fully implicit (based on policy configuration similar to DeviceTaintRule) and "applied" invisibly by the TaintToleration plugin during the Filtering phase. However, the problem is that during the Filtering phase in the TaintToleration plugin we may not know which devices got allocated by the DRA plugin, so it may be hard to take decision based on it.

This is why I think that tainting nodes is not the best way to prevent scheduling pods on nodes with scarce resources. Ideally picking or discarding nodes with scarce unused resources should be rather based on scoring mechanisms, or filtering, which could block some placements based on custom policies. So if we forget about taints and toleration mechanism, such logic could be implemented inside DRA plugin.

dom4ha · 2025-06-12T00:44:42Z

We will take a look, but very likely we cannot make it by the KEP freeze because our bandwidth is limited and there are other on-going KEPs.

I agree with @sanposhiho , but also I think there will be ton of code changes in this cycle, which will be very challenging to handle. I'm not convinced that the use case really justifies the effort, as it's not clear to me whether taints and tolerations should be the solution we recommend.

SergeyKanzhelev · 2025-06-16T20:44:31Z

keps/sig-scheduling/5282-implicit-tolerations/README.md

+
+Introduce a new PreEnqueue scheduler plugin (or extend the existing TaintToleration plugin) that:
+
+- Depends on the DRA PreEnqueue plugin to ensure all ResourceClaims and DeviceClasses are resolved.


what if there are multiple resolutions possible? Will we remove all tolerations so scheduler can decide later where to schedule?

cici37 · 2025-06-17T18:27:34Z

The KEP is discussed in the latest sig scheduling meeting: #5282 (comment) and we'll hold the merge til get an agreement on the use case support.

Add KEP for Implicit Tolerations.

5e99772

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 9, 2025

k8s-ci-robot requested review from johnbelamaric and macsko June 9, 2025 19:59

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jun 9, 2025

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 9, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Jun 9, 2025

github-project-automation bot added this to SIG Scheduling Jun 9, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 9, 2025

k8s-ci-robot assigned dom4ha, johnbelamaric and sanposhiho Jun 9, 2025

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 9, 2025

github-project-automation bot added this to Dynamic Resource Allocation Jun 9, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jun 9, 2025

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 10, 2025

cici37 mentioned this pull request Jun 6, 2025

Implicit tolerations #5282

Open

4 tasks

dom4ha reviewed Jun 12, 2025

View reviewed changes

SergeyKanzhelev reviewed Jun 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KEP-5282]Add KEP for Implicit Tolerations #5389

[KEP-5282]Add KEP for Implicit Tolerations #5389

cici37 commented Jun 9, 2025

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

cici37 commented Jun 9, 2025

Uh oh!

mortent commented Jun 9, 2025

Uh oh!

sanposhiho commented Jun 11, 2025

Uh oh!

dom4ha Jun 12, 2025

Uh oh!

dom4ha commented Jun 12, 2025

Uh oh!

SergeyKanzhelev Jun 16, 2025

Uh oh!

cici37 commented Jun 17, 2025

Uh oh!

Uh oh!


		## Proposal

		Introduce a new PreEnqueue scheduler plugin (or extend the existing TaintToleration plugin) that:


		Introduce a new PreEnqueue scheduler plugin (or extend the existing TaintToleration plugin) that:

		- Depends on the DRA PreEnqueue plugin to ensure all ResourceClaims and DeviceClasses are resolved.

[KEP-5282]Add KEP for Implicit Tolerations #5389

Are you sure you want to change the base?

[KEP-5282]Add KEP for Implicit Tolerations #5389

Conversation

cici37 commented Jun 9, 2025

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

cici37 commented Jun 9, 2025

Uh oh!

mortent commented Jun 9, 2025

Uh oh!

sanposhiho commented Jun 11, 2025

Uh oh!

dom4ha Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Jun 12, 2025

Uh oh!

SergeyKanzhelev Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

cici37 commented Jun 17, 2025

Uh oh!

Uh oh!