Skip to content

[WIP] Extending the semantics of nominated node name #5329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wojtek-t
Copy link
Member

  • One-line PR description:
  • Issue link:
  • Other comments:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 22, 2025
@k8s-ci-robot k8s-ci-robot requested review from macsko and palnabarun May 22, 2025 09:09
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling May 22, 2025
@wojtek-t wojtek-t marked this pull request as draft May 22, 2025 09:09
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 22, 2025
@@ -205,21 +205,22 @@ based on the expected pod placement.

### External components want to specify a preferred pod placement

The cluster autoscaler or Kueue internally calculates the pod placement,
and create new nodes or un-gate pods based on the calculation result.
The ClusterAutoscaler or Karpenter internally calculate the pod placement,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho - there are multiple changes including:

  • small updates to motivation and user stories
  • rewrite of risks section based on the end state we propose
  • discussion of new risk
  • addressing some comments discussed on the original PR

I think that you should take adjustments from my commit that you think make sense and just incorporate them to your PR and I will close this one after that.

be treated as a hint to scheduler and not as an official scheduler's plan of record
- if `NominatedNodeName` was set by an external component (based on ManagedFields), this
component should clear or update it (using PUT or APPLY to ensure it won't conflict with
potentil update from scheduler) to reflect the new hint
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dom4ha - in the context of different phases of reservation/allocation of resources that we were talking about, I think that what I'm proposing above paths they way towards additive extension in the future

With such approach, we're describing how the semantics should work, and eventually we can add stage field that would reflect that more explicitly.

Comment on lines 365 to 366
- [to be confirmed during API review] we fail the validation when someone attempts to set
`NNN` for an already bound pod
Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this? From our perspective, it does not make sense to set NNN for bound pods though, I'm afraid that this validation would just break some external components unnecessarily, which have already utilized NNN and have a (weird?) behavior to set NNN for bound pods for some purpose.
I'm not sure if we should take such a weird case into account, which we're not even sure if exists, though, at the same time, I'm not sure if we should explicitly try to ban those API operations on purpose, from now.

Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, what I wanted to say is: do we have any advantage to do this by taking a risk of breaking existing components that might exist in this world? If someone does this, then it might make sense to them for their specific use cases. And, it's just their fault that, as a result, NNN vs NodeName could be different and might be confusing. But, they should already know that, and still do that on purpose for some reason. I'm not sure if we should explicitly forbid it from now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation was that it's misleading when NodeName and NominatedNodeName are different, no matter when they were set. This second part was addressing the case when NNN is set later.

I'm fine with saying that it's such a niche thing that we don't do that for the reasons you mentioned.

Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is if NNN is set later, that might be because of some use cases that we haven't considered. Of course, we can say NNN for using such unknown use cases isn't supported, and just ban such requests for bound pods. But, I guess we don't have a strong reason to do that deliberately. So, I think we can hold the second one for now.

OTOH, doing the first one sounds ok, in order to reduce the side effect caused by this KEP. Because, after this proposal, we believe more components will start using NNN, and, as a result, more pods could eventually be bound to nodes different from NNN, which is "side effect" that I meant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and already adjusted it accordingly.

should try to modify it
- if `NominatedNodeName` was set by an external component (based on ManagedFields), it should
be treated as a hint to scheduler and not as an official scheduler's plan of record
- if `NominatedNodeName` was set by an external component (based on ManagedFields), this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- introducing a new field (e.g. `NodeNameHint` or sth like that for that purpose)
- using `pod.ObjectMeta.ManagedFields` to detect who set the field

Given that, even as of now arbitrary component or actor can already set `NNN`, we suggest to
Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep arguing in GH threads and Slack DM though, I still don't see why we need to distinguish "who" set NNN, right now. I don't see any use case or problem that requires us to be conscious of who set NNN. (and I've asked what's usecase/problem you both are seeing.) And, unless we find the one, I'd prefer just simply treating NNN equally, regardless of who set it.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN set by others, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?
Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

My alternative rules here is:

  • Regardless of who set NNN for what purpose, external components shouldn't update/clear NNN set by others.
    • Only the scheduler is allowed to overwrite any NNN, in case of preemption or at the beginning of binding cycles. This is based on the assumption that the reliability of the scheduler's preemption decision is higher than other components' hints because the scheduler always simulates the scheduling perfectly. But, if we don't agree on this assumption, then I'm even OK not to execute the preemptions when a pod has NNN, so that even the preemption isn't allowed to overwrite NNN from other components.
  • Anyone (including the scheduler) who set NNN on the pod should be responsible for clearing NNN or updating it with a new hint.
  • Regardless of who set NNN for what purpose, NNN readers (e.g., the cluster autoscaler when it's trying to scale down the nodes) have to always take NNN into consideration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

I should have said: "no external component should try to modify NNN if it was set by someone else".
I agree with that.
But any component can modify it if it was previously set by them.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?

By everyone (including scheduler itself).
In the end it's scheduler that's making the final decision, and up until then it can be arbitrary garbage in the NNN field.

Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

We're entering the playground of multiple schedulers with this (if other components can do preemption).
I wanted to avoid that.

My alternative rules here is: [...]

I think we're both saying very similar thing

  1. you're saying that external components can update/clear NNN
    I'm saying there is an exception if that's you who set that, then you can change it later.
    But you're saying that in your second point - so we're saying the same thing

  2. You're saying that scheduler can change that anytime - this was implicit in my point.
    So again, we're saying the same thing

  3. You're saying that regardless who sets it it should be treated the same.

This seems to be the only real difference. If it was set by scheduler, we know that scheduler believes it can reach that state. If it was set by some other component, we may not even have a path to get there.

So it boils down to the question about usecases. I think the future usecase is "external schedulers".
We want to allow external schedulers to make scheduling, but at the same time to make it really work we need to provide coordination mechanism.

If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.
Now let's assume that there is lower-priority pod already running on that node. Someone would need to preempt it. To ensure some coordination, it would be scheduler.
So, depending who put the nomination, the difference is:

  • if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me
  • if it's scheduler - it's effectively my decision for now, and I'm already preempting the pods

However, as I wrote in 192a9a1#r2102065375 I think that by introducing these rules (and I think we're effectively saying the same thing), we're making it possible to extend that in the future in the additive way.

I like your wording though. I wil try to somehow emphasize the difference and switch to your wording today/tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I went ahead and tried to update it - PTAL

Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the future usecase is "external schedulers".
...
If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.

This "external schedulers" use case is actually close to Kueue use case, isn't it?

  1. If they have to force the main scheduler to pick up their decision, they have to use a required node selector with NNN, and then un-gate pods.
  2. If they're ok with the main scheduler changing the decision if NNN isn't valid, then they can just set NNN only.

And, regardless of whether NNN is set by the scheduler (performed the preemption for the pod) or NNN is set by an external scheduler, the scheduling cycle simply ignores/honors those two NNNs based on the pod's priority.

If there's no space on the NNN node, the scheduler proceeds to the preemption. As mentioned first, if "external scheduler" needs the scheduler to pick up their NNN as a final result, then they must put a node selector so that the preemption should try making a space on nominated/selected node. In this way, it satisfies your example scenario: if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me.

So, TL;DR, I don't see "who sets NNN" as necessary to your "external scheduler" use case as well. It's just all about the pod priority; the scheduling cycle doesn't (have to) understand NNN is from who.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "external schedulers" use case is actually close to Kueue use case, isn't it?

It's a semantic difference - let's say that I'm fine with scheduler changing my decision. But the difference is:

  1. if external scheduler set NNN - it's a hint and nothing happened
  2. it it was set by scheduler - the preemption was already triggered to make place for it

So if I care to see in what place of state-machine I am not able to do that.

As I said before - I think it's fine for now and it's possible to fix that in an additive way.
But that difference may be important (as a scheduler - should I still do a preemption, or I'm only waiting for things).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The KEP change on this part now looks great. So, my comment here is just continuing the discussion to be on the same page with you for the long-term goal, not trying to ask you to change things more on KEP.)

So if I care to see in what place of state-machine I am not able to do that.

Right, you couldn't, but I still don't understand why the scheduling cycle has to behave differently for those two cases, based on who set NNN.

should I still do a preemption, or I'm only waiting for things

For what use cases the scheduler has to wait for things, instead of skipping the preemption? IMO, regardless of (1) or (2), the scheduler should always try a preemption, if the scheduling cycle fails. What possibility are you seeing there? Why does the scheduler want to skip the preemption?


Explain what I'm imagining here about the external scheduler use case so that you may get the difference between my idea and yours:

  1. Users want to have custom scheduling. Here, let's say TAS. And, they have a sub-scheduler that computes TAS.
  2. When new pods are created, they would probably have to gate those pods first to prevent the main scheduler from scheduling those pods before the sub-scheduler puts NNN. The sub-scheduler puts NNN based on TAS, and un-gate pods.
    1. The alternative idea is to use SchedulerName on PodSpec. New pods are, first, created with SchedulerName: sub-scheduler, the sub-scheduler puts NNN based on TAS, and it changes SchedulerName: default-scheduler. (not sure SchedulerName field is mutable now though)
  3. The main scheduler starts to schedule pods with NNN, and pods may or may not be scheduled onto NNN node. Let's say pods are unschedulable to explore unhappy path.
  4. Here, there are different use cases:
    1. [If the sub-scheduler is OK that the main scheduler picks up the node different from NNN eventually] NNN is the best effort, like we discussed on this KEP. So, in this case, the sub-scheduler doesn't have to care about the pods after un-gate them. Pods with NNN may or may not go to NNN nodes.
    2. [If the sub-scheduler doesn't want the main scheduler to pick up the nodes different from NNN] In this case, first of all, the sub-scheduler has to put a required NodeSelector in addition to NNN. Then, the main scheduler would keep trying to schedule pods on NNN node, maybe trying the preemption on the node as well. The sub-scheduler may want to recompute TAS and change NNN/NodeSelector once it notices the main scheduler has failed to schedule pods. Here, we need some improvement because NodeSelector is immutable after un-gating pods. We need to either somehow allow pods to be re-gated after un-gated, or somehow allow NodeSelector to be mutable.

In this flow, I don't see the need of "who set NNN". The main scheduler would always try to schedule pods based on the scheduling constraints and NNN, and also tries to preempt pods. The sub-scheduler would always try to put NNN (sometimes with NodeSelector), and may want to keep updating them if the main scheduler fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The KEP change on this part now looks great. So, my comment here is just continuing the discussion to be on the same page with you for the long-term goal, not trying to ask you to change things more on KEP.)

great to hear we're aligned here :)
Would it make sense for you to take the commit and apply into your PR?

Explain what I'm imagining here about the external scheduler use case so that you may get the difference between my idea and yours:

In the example that you describe it makes sense.
The example that I'm thinking about is to allow more interactive approach:

  • external scheduler sets NNN, but more in a form of "would this work" (new state in the state machine)
  • if it works, kube-scheduler proceeds with it, otherwise it actually it rejects it (instead of finding new placement themselves) and let the external scheduler to try to find a new placement again

We can use NNN for it, but we need another bit for "state-machine" there in the API. And this is what would matter. So I agree that it kind-of doesn't matter who set it - it's the state in the state-machine that matters.
And we at any point decide to proceed with any idea like this - we would just expose this state machine as a new field.
So the more I think about it, the more I think I agree that it doesn't matter who set. What may matter is some additional state-machine - but that we can then think about a separate feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it works, kube-scheduler proceeds with it, otherwise it actually it rejects it (instead of finding new placement themselves) and let the external scheduler to try to find a new placement again

In my flow, it's rejected because of the required node selector (described at 4-ii above), then the external scheduler should be able to notice the scheduling failure simply via the pod's PodScheduled: false condition, and can put a new suggestion on NNN and the node selector.
So, in your words, I'm wondering if the existing pod's PodScheduled: false condition is just enough for state-machine. The external scheduler can, at least, tell if the suggestion added to NNN is rejected by the scheduling cycle or not. If there's an advanced use case that PodScheduled: false condition is not expressive enough, then Yes, we'd need another field to express detailed state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PodScheduled condition is interesting path to explore. I would need to spend more time on it, but it might be enough. That being said - I think that even if not - we both agree that it should be possible to extend that in an additive way - which is all we need to ensure at that point.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on the improvement/clarification on the KEP, overall agree except a few points that I left comments.

@@ -402,21 +444,34 @@ Higher-priority pods can ignore it, but pods with equal or lower priority don't
This allows us to prioritize nominated pods when nomination was done by external components.
We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.

TODO: We need to ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that scenario cause the memory increase and leak in the scheduler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pure implementation comment.
If the node has NNN set and such NNN doesn't exist, we need to ensure that scheduler cache will be appropriately cleared when NNN changes (i.e. the trigger for deleting the cached information about the node is no longer only the node deletion).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node has NNN set and such NNN doesn't exist, the scheduler should do nothing with the cache. i.e., no memory increase in the scheduler cache by NNN to non-existing nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that we have to cache that information.

Because as soon as the node comes up (cluster-autoscaler case), we need to know from the beginning that some resources on it are already "booked" via pods nominated to it. Or do you want to search through all pods at that point? That's certainly an option too, it's less performant, but maybe it's still good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay, that is a good point. I agree with you now; we should keep index pods with NNN even if NNN node doesn't exist, and properly drop that index when NNN is updated/cleared.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this, I checked the implementation and found we don't have to do anything to support NNN for non-existing nodes. Add the explanation at:
97bc5ab

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect - thank you!

@wojtek-t wojtek-t force-pushed the nominationplusplus branch from 192a9a1 to 2553b15 Compare May 23, 2025 14:16
@sanposhiho
Copy link
Member

Would it make sense for you to take the commit and apply into your PR?

Yes, will do it later today.

@@ -239,68 +240,129 @@ If we can keep where the pod was going to go at `NominatedNodeName`, the schedul
Here is the all use cases of NominatedNodeNames that we're taking into consideration:
- The scheduler puts it after the preemption (already implemented)
- The scheduler puts it at the beginning of binding cycles (only if the binding cycles invole PreBind phase)
- The cluster autoscaler puts it after creating a new node for pending pod(s) so that the scheduler can find a place faster when the node is created.
- Kueue uses it to determine a prefered node for the pod based on their internal calculation (Topology aware scheduling etc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed Kueue use case on purpose? I thought Kueue is a good example to show a different use case though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it on purpose - because that's the usecase we don't want to handle with this feature. Keeping it here is misleading imho.

We may want to point to it somewhater, but we should make it clear that it's not something we're addressing.

Copy link
Member

@sanposhiho sanposhiho May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surely the use case we can handle, right?
Both of required/preferred TAS can utilize it to increase the likelihood of pods getting to the node that Kueue computes for TAS. Because, at the point Kueue puts NNN, the scheduler reserves the place and only higher priority pods can steal it. On the other hand, with the current scheduler (w/o NNN), even if they put a required node selector for a required TAS, the place won't be reserved until the pod is actually handled by the scheduling cycle, and some lower/equal priority pods can steal the place of TAS pods meanwhile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like we discussed at #5329 (comment), maybe ideally we should let Kueue re-compute the place when pods with a required TAS become unschedulable. (i.e., Kueue will behave like a external scheduler which we discussed there)
But, I believe the external scheduler idea is a feature continuous from this feature, and specifically, this Kueue's use case here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of required/preferred TAS can utilize it to increase the likelihood of pods getting to the node that Kueue computes for TAS. Because, at the point Kueue puts NNN, the scheduler reserves the place and only higher priority pods can steal it.

I think this is a communication problem - Kueue needs to communicate its decisions to kube-scheduler, so when scheduler learns about NNN - it can learn about node-selector at the same time and would make the same decision.
But I guess when you may be heading is once we have workload-aware scheduling, the moment when we learn about the pod may not be the moment when we actually are ready to process it - in which case having NNN set may indeed be beneficial.

So I generally agree with you (especially external scheduler idea being extension of this feature). I guess what I more meant is that in order to be fully useful for Kueue usecase, we would need a bit of extensions - and this KEP on its own doesn't fully address Kueue usecase. But it opens the path towards it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the moment when we learn about the pod may not be the moment when we actually are ready to process it

Yes, this is what I meant, but not only about the workload scheduling. Even today, if there are many other pods in activeQ, pods that Kueue set NNN on could not be immediately poped to the scheduling cycles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - so I think we're on the same page here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, do you still not prefer mentioning it on this KEP? I still see it's worth mentioning, along with the discussion we've had.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

3 participants