[WIP] Extending the semantics of nominated node name #5329

wojtek-t · 2025-05-22T09:09:19Z

One-line PR description:

Issue link:

Other comments:

k8s-ci-robot · 2025-05-22T09:09:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2025-05-22T09:14:53Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

@@ -205,21 +205,22 @@ based on the expected pod placement.

 ### External components want to specify a preferred pod placement

-The cluster autoscaler or Kueue internally calculates the pod placement,
-and create new nodes or un-gate pods based on the calculation result. 
+The ClusterAutoscaler or Karpenter internally calculate the pod placement,


@sanposhiho - there are multiple changes including:

small updates to motivation and user stories

rewrite of risks section based on the end state we propose

discussion of new risk

addressing some comments discussed on the original PR

I think that you should take adjustments from my commit that you think make sense and just incorporate them to your PR and I will close this one after that.

wojtek-t · 2025-05-22T09:19:55Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+be treated as a hint to scheduler and not as an official scheduler's plan of record
+- if `NominatedNodeName` was set by an external component (based on ManagedFields), this
+component should clear or update it (using PUT or APPLY to ensure it won't conflict with
+potentil update from scheduler) to reflect the new hint


@dom4ha - in the context of different phases of reservation/allocation of resources that we were talking about, I think that what I'm proposing above paths they way towards additive extension in the future

With such approach, we're describing how the semantics should work, and eventually we can add stage field that would reflect that more explicitly.

sanposhiho · 2025-05-22T11:36:06Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+- [to be confirmed during API review] we fail the validation when someone attempts to set
+`NNN` for an already bound pod


Do we actually need this? From our perspective, it does not make sense to set NNN for bound pods though, I'm afraid that this validation would just break some external components unnecessarily, which have already utilized NNN and have a (weird?) behavior to set NNN for bound pods for some purpose.
I'm not sure if we should take such a weird case into account, which we're not even sure if exists, though, at the same time, I'm not sure if we should explicitly try to ban those API operations on purpose, from now.

Probably, what I wanted to say is: do we have any advantage to do this by taking a risk of breaking existing components that might exist in this world? If someone does this, then it might make sense to them for their specific use cases. And, it's just their fault that, as a result, NNN vs NodeName could be different and might be confusing. But, they should already know that, and still do that on purpose for some reason. I'm not sure if we should explicitly forbid it from now

My motivation was that it's misleading when NodeName and NominatedNodeName are different, no matter when they were set. This second part was addressing the case when NNN is set later.

I'm fine with saying that it's such a niche thing that we don't do that for the reasons you mentioned.

My point is if NNN is set later, that might be because of some use cases that we haven't considered. Of course, we can say NNN for using such unknown use cases isn't supported, and just ban such requests for bound pods. But, I guess we don't have a strong reason to do that deliberately. So, I think we can hold the second one for now.

OTOH, doing the first one sounds ok, in order to reduce the side effect caused by this KEP. Because, after this proposal, we believe more components will start using NNN, and, as a result, more pods could eventually be bound to nodes different from NNN, which is "side effect" that I meant.

I went ahead and already adjusted it accordingly.

sanposhiho · 2025-05-22T12:00:59Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+should try to modify it
+- if `NominatedNodeName` was set by an external component (based on ManagedFields), it should
+be treated as a hint to scheduler and not as an official scheduler's plan of record
+- if `NominatedNodeName` was set by an external component (based on ManagedFields), this


sanposhiho · 2025-05-22T12:07:04Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+- introducing a new field (e.g. `NodeNameHint` or sth like that for that purpose)
+- using `pod.ObjectMeta.ManagedFields` to detect who set the field
+
+Given that, even as of now arbitrary component or actor can already set `NNN`, we suggest to


I keep arguing in GH threads and Slack DM though, I still don't see why we need to distinguish "who" set NNN, right now. I don't see any use case or problem that requires us to be conscious of who set NNN. (and I've asked what's usecase/problem you both are seeing.) And, unless we find the one, I'd prefer just simply treating NNN equally, regardless of who set it.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN set by others, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?
Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

My alternative rules here is:

Regardless of who set NNN for what purpose, external components shouldn't update/clear NNN set by others.

Only the scheduler is allowed to overwrite any NNN, in case of preemption or at the beginning of binding cycles. This is based on the assumption that the reliability of the scheduler's preemption decision is higher than other components' hints because the scheduler always simulates the scheduling perfectly. But, if we don't agree on this assumption, then I'm even OK not to execute the preemptions when a pod has NNN, so that even the preemption isn't allowed to overwrite NNN from other components.

Anyone (including the scheduler) who set NNN on the pod should be responsible for clearing NNN or updating it with a new hint.

Regardless of who set NNN for what purpose, NNN readers (e.g., the cluster autoscaler when it's trying to scale down the nodes) have to always take NNN into consideration.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

I should have said: "no external component should try to modify NNN if it was set by someone else".
I agree with that.
But any component can modify it if it was previously set by them.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?

By everyone (including scheduler itself).
In the end it's scheduler that's making the final decision, and up until then it can be arbitrary garbage in the NNN field.

Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

We're entering the playground of multiple schedulers with this (if other components can do preemption).
I wanted to avoid that.

My alternative rules here is: [...]

I think we're both saying very similar thing

you're saying that external components can update/clear NNN
I'm saying there is an exception if that's you who set that, then you can change it later.
But you're saying that in your second point - so we're saying the same thing

You're saying that scheduler can change that anytime - this was implicit in my point.
So again, we're saying the same thing

You're saying that regardless who sets it it should be treated the same.

This seems to be the only real difference. If it was set by scheduler, we know that scheduler believes it can reach that state. If it was set by some other component, we may not even have a path to get there.

So it boils down to the question about usecases. I think the future usecase is "external schedulers".
We want to allow external schedulers to make scheduling, but at the same time to make it really work we need to provide coordination mechanism.

If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.
Now let's assume that there is lower-priority pod already running on that node. Someone would need to preempt it. To ensure some coordination, it would be scheduler.
So, depending who put the nomination, the difference is:

if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me

if it's scheduler - it's effectively my decision for now, and I'm already preempting the pods

However, as I wrote in 192a9a1#r2102065375 I think that by introducing these rules (and I think we're effectively saying the same thing), we're making it possible to extend that in the future in the additive way.

I like your wording though. I wil try to somehow emphasize the difference and switch to your wording today/tomorrow.

So I went ahead and tried to update it - PTAL

I think the future usecase is "external schedulers".
...
If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.

This "external schedulers" use case is actually close to Kueue use case, isn't it?

If they have to force the main scheduler to pick up their decision, they have to use a required node selector with NNN, and then un-gate pods.

If they're ok with the main scheduler changing the decision if NNN isn't valid, then they can just set NNN only.

And, regardless of whether NNN is set by the scheduler (performed the preemption for the pod) or NNN is set by an external scheduler, the scheduling cycle simply ignores/honors those two NNNs based on the pod's priority.

If there's no space on the NNN node, the scheduler proceeds to the preemption. As mentioned first, if "external scheduler" needs the scheduler to pick up their NNN as a final result, then they must put a node selector so that the preemption should try making a space on nominated/selected node. In this way, it satisfies your example scenario: if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me.

So, TL;DR, I don't see "who sets NNN" as necessary to your "external scheduler" use case as well. It's just all about the pod priority; the scheduling cycle doesn't (have to) understand NNN is from who.

This "external schedulers" use case is actually close to Kueue use case, isn't it?

It's a semantic difference - let's say that I'm fine with scheduler changing my decision. But the difference is:

if external scheduler set NNN - it's a hint and nothing happened

it it was set by scheduler - the preemption was already triggered to make place for it

So if I care to see in what place of state-machine I am not able to do that.

As I said before - I think it's fine for now and it's possible to fix that in an additive way.
But that difference may be important (as a scheduler - should I still do a preemption, or I'm only waiting for things).

(The KEP change on this part now looks great. So, my comment here is just continuing the discussion to be on the same page with you for the long-term goal, not trying to ask you to change things more on KEP.)

So if I care to see in what place of state-machine I am not able to do that.

Right, you couldn't, but I still don't understand why the scheduling cycle has to behave differently for those two cases, based on who set NNN.

should I still do a preemption, or I'm only waiting for things

For what use cases the scheduler has to wait for things, instead of skipping the preemption? IMO, regardless of (1) or (2), the scheduler should always try a preemption, if the scheduling cycle fails. What possibility are you seeing there? Why does the scheduler want to skip the preemption?

Explain what I'm imagining here about the external scheduler use case so that you may get the difference between my idea and yours:

Users want to have custom scheduling. Here, let's say TAS. And, they have a sub-scheduler that computes TAS.

When new pods are created, they would probably have to gate those pods first to prevent the main scheduler from scheduling those pods before the sub-scheduler puts NNN. The sub-scheduler puts NNN based on TAS, and un-gate pods.

The alternative idea is to use SchedulerName on PodSpec. New pods are, first, created with SchedulerName: sub-scheduler, the sub-scheduler puts NNN based on TAS, and it changes SchedulerName: default-scheduler. (not sure SchedulerName field is mutable now though)

The main scheduler starts to schedule pods with NNN, and pods may or may not be scheduled onto NNN node. Let's say pods are unschedulable to explore unhappy path.

Here, there are different use cases:

[If the sub-scheduler is OK that the main scheduler picks up the node different from NNN eventually] NNN is the best effort, like we discussed on this KEP. So, in this case, the sub-scheduler doesn't have to care about the pods after un-gate them. Pods with NNN may or may not go to NNN nodes.

[If the sub-scheduler doesn't want the main scheduler to pick up the nodes different from NNN] In this case, first of all, the sub-scheduler has to put a required NodeSelector in addition to NNN. Then, the main scheduler would keep trying to schedule pods on NNN node, maybe trying the preemption on the node as well. The sub-scheduler may want to recompute TAS and change NNN/NodeSelector once it notices the main scheduler has failed to schedule pods. Here, we need some improvement because NodeSelector is immutable after un-gating pods. We need to either somehow allow pods to be re-gated after un-gated, or somehow allow NodeSelector to be mutable.

In this flow, I don't see the need of "who set NNN". The main scheduler would always try to schedule pods based on the scheduling constraints and NNN, and also tries to preempt pods. The sub-scheduler would always try to put NNN (sometimes with NodeSelector), and may want to keep updating them if the main scheduler fails.

(The KEP change on this part now looks great. So, my comment here is just continuing the discussion to be on the same page with you for the long-term goal, not trying to ask you to change things more on KEP.)

great to hear we're aligned here :)
Would it make sense for you to take the commit and apply into your PR?

Explain what I'm imagining here about the external scheduler use case so that you may get the difference between my idea and yours:

In the example that you describe it makes sense.
The example that I'm thinking about is to allow more interactive approach:

external scheduler sets NNN, but more in a form of "would this work" (new state in the state machine)

if it works, kube-scheduler proceeds with it, otherwise it actually it rejects it (instead of finding new placement themselves) and let the external scheduler to try to find a new placement again

We can use NNN for it, but we need another bit for "state-machine" there in the API. And this is what would matter. So I agree that it kind-of doesn't matter who set it - it's the state in the state-machine that matters.
And we at any point decide to proceed with any idea like this - we would just expose this state machine as a new field.
So the more I think about it, the more I think I agree that it doesn't matter who set. What may matter is some additional state-machine - but that we can then think about a separate feature.

if it works, kube-scheduler proceeds with it, otherwise it actually it rejects it (instead of finding new placement themselves) and let the external scheduler to try to find a new placement again

In my flow, it's rejected because of the required node selector (described at 4-ii above), then the external scheduler should be able to notice the scheduling failure simply via the pod's PodScheduled: false condition, and can put a new suggestion on NNN and the node selector.
So, in your words, I'm wondering if the existing pod's PodScheduled: false condition is just enough for state-machine. The external scheduler can, at least, tell if the suggestion added to NNN is rejected by the scheduling cycle or not. If there's an advanced use case that PodScheduled: false condition is not expressive enough, then Yes, we'd need another field to express detailed state.

The PodScheduled condition is interesting path to explore. I would need to spend more time on it, but it might be enough. That being said - I think that even if not - we both agree that it should be possible to extend that in an additive way - which is all we need to ensure at that point.

sanposhiho

Thanks for working on the improvement/clarification on the KEP, overall agree except a few points that I left comments.

sanposhiho · 2025-05-22T12:13:37Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

@@ -402,21 +444,34 @@ Higher-priority pods can ignore it, but pods with equal or lower priority don't
 This allows us to prioritize nominated pods when nomination was done by external components. 
 We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.

+TODO: We need to ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory.


How does that scenario cause the memory increase and leak in the scheduler?

It's a pure implementation comment.
If the node has NNN set and such NNN doesn't exist, we need to ensure that scheduler cache will be appropriately cleared when NNN changes (i.e. the trigger for deleting the cached information about the node is no longer only the node deletion).

If the node has NNN set and such NNN doesn't exist, the scheduler should do nothing with the cache. i.e., no memory increase in the scheduler cache by NNN to non-existing nodes.

My point is that we have to cache that information.

Because as soon as the node comes up (cluster-autoscaler case), we need to know from the beginning that some resources on it are already "booked" via pods nominated to it. Or do you want to search through all pods at that point? That's certainly an option too, it's less performant, but maybe it's still good enough.

Hmm, okay, that is a good point. I agree with you now; we should keep index pods with NNN even if NNN node doesn't exist, and properly drop that index when NNN is updated/cleared.

Regarding this, I checked the implementation and found we don't have to do anything to support NNN for non-existing nodes. Add the explanation at:
97bc5ab

Perfect - thank you!

sanposhiho · 2025-05-26T12:00:08Z

Would it make sense for you to take the commit and apply into your PR?

Yes, will do it later today.

sanposhiho · 2025-05-26T13:38:39Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

@@ -239,68 +240,129 @@ If we can keep where the pod was going to go at `NominatedNodeName`, the schedul
 Here is the all use cases of NominatedNodeNames that we're taking into consideration:
 - The scheduler puts it after the preemption (already implemented)
 - The scheduler puts it at the beginning of binding cycles (only if the binding cycles invole PreBind phase)
- The cluster autoscaler puts it after creating a new node for pending pod(s) so that the scheduler can find a place faster when the node is created.
- Kueue uses it to determine a prefered node for the pod based on their internal calculation (Topology aware scheduling etc)


You removed Kueue use case on purpose? I thought Kueue is a good example to show a different use case though

I removed it on purpose - because that's the usecase we don't want to handle with this feature. Keeping it here is misleading imho.

We may want to point to it somewhater, but we should make it clear that it's not something we're addressing.

This is surely the use case we can handle, right?
Both of required/preferred TAS can utilize it to increase the likelihood of pods getting to the node that Kueue computes for TAS. Because, at the point Kueue puts NNN, the scheduler reserves the place and only higher priority pods can steal it. On the other hand, with the current scheduler (w/o NNN), even if they put a required node selector for a required TAS, the place won't be reserved until the pod is actually handled by the scheduling cycle, and some lower/equal priority pods can steal the place of TAS pods meanwhile.

Like we discussed at #5329 (comment), maybe ideally we should let Kueue re-compute the place when pods with a required TAS become unschedulable. (i.e., Kueue will behave like a external scheduler which we discussed there)
But, I believe the external scheduler idea is a feature continuous from this feature, and specifically, this Kueue's use case here.

Both of required/preferred TAS can utilize it to increase the likelihood of pods getting to the node that Kueue computes for TAS. Because, at the point Kueue puts NNN, the scheduler reserves the place and only higher priority pods can steal it.

I think this is a communication problem - Kueue needs to communicate its decisions to kube-scheduler, so when scheduler learns about NNN - it can learn about node-selector at the same time and would make the same decision.
But I guess when you may be heading is once we have workload-aware scheduling, the moment when we learn about the pod may not be the moment when we actually are ready to process it - in which case having NNN set may indeed be beneficial.

So I generally agree with you (especially external scheduler idea being extension of this feature). I guess what I more meant is that in order to be fully useful for Kueue usecase, we would need a bit of extensions - and this KEP on its own doesn't fully address Kueue usecase. But it opens the path towards it.

the moment when we learn about the pod may not be the moment when we actually are ready to process it

Yes, this is what I meant, but not only about the workload scheduling. Even today, if there are many other pods in activeQ, pods that Kueue set NNN on could not be immediately poped to the scheduling cycles.

OK - so I think we're on the same page here

And, do you still not prefer mentioning it on this KEP? I still see it's worth mentioning, along with the discussion we've had.

Extending the semantics of nominated node name

770913e

k8s-ci-robot requested review from macsko and palnabarun May 22, 2025 09:09

github-project-automation bot moved this to Needs Triage in SIG Scheduling May 22, 2025

github-project-automation bot added this to SIG Scheduling May 22, 2025

wojtek-t marked this pull request as draft May 22, 2025 09:09

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 22, 2025

wojtek-t mentioned this pull request May 22, 2025

kep-5278: nominated node name for an expected pod placement #5287

Draft

wojtek-t commented May 22, 2025

View reviewed changes

sanposhiho reviewed May 22, 2025

View reviewed changes

NominatedNodeName KEP adjustments

2553b15

wojtek-t force-pushed the nominationplusplus branch from 192a9a1 to 2553b15 Compare May 23, 2025 14:16

sanposhiho reviewed May 26, 2025

View reviewed changes

sanposhiho mentioned this pull request May 27, 2025

KEP-3990: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #4150

Open

		- [to be confirmed during API review] we fail the validation when someone attempts to set
		`NNN` for an already bound pod

[WIP] Extending the semantics of nominated node name #5329

Are you sure you want to change the base?

[WIP] Extending the semantics of nominated node name #5329

Conversation

wojtek-t commented May 22, 2025

Uh oh!

k8s-ci-robot commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho commented May 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 27, 2025 •

edited

Loading