Skip to content

Conversation

@norbertcyran
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a proposal for support of granular resource limits in node autoscalers.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @norbertcyran. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbertcyran
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@norbertcyran
Copy link
Contributor Author

This proposal was initially discussed within autoscaling SIG in this doc: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?usp=sharing

@ellistarn
Copy link
Contributor

FYI -- there is a karpenter specific proposal for this feature: https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R23. One of the key challenges mentioned is the "launch before terminate" challenge when doing upgrades and gracefully rolling capacity. This can cause you to get stuck if you're at your limits.

Thoughts on expanding this proposal to reason about this case?

@norbertcyran
Copy link
Contributor Author

FYI -- there is a karpenter specific proposal for this feature: https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R23. One of the key challenges mentioned is the "launch before terminate" challenge when doing upgrades and gracefully rolling capacity. This can cause you to get stuck if you're at your limits.

Thoughts on expanding this proposal to reason about this case?

From the perspective of the current state of CAS, differentiation between hard and soft limits does not make much sense, as there's no equivalent of Karpenter's node replacement. Scale down only consolidates the node if all pods running on it can be scheduled on other nodes already existing in the cluster. Therefore, functionality-wise, there would be no difference between hard and soft limits. However, such a solution would be more future-proof.

FWIW, in GKE we have surge upgrades: https://docs.cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies#surge. As they are not handled by CAS, they bypass resource limits, but additionally we don't want the surge nodes to be counted towards the limits at all (i.e. if there exist both the old node and the surge node, we count them as one), so they don't block scale ups. To handle that, in the CAS resource limits implementation we plan to have a NodeFilter interface allowing to filter out certain nodes from the usage calculation: https://github.com/kubernetes/autoscaler/pull/8662/files#diff-03d8f6b8cba668e6137329c4df0e8a979be244ea7c7465a4d8f10ca08849eb5aR12-R14, https://github.com/kubernetes/autoscaler/pull/8662/files#diff-03d8f6b8cba668e6137329c4df0e8a979be244ea7c7465a4d8f10ca08849eb5aR35-R37

Alternatively to the soft and hard limits, I was thinking that something like NodeFilter could be used to bypass the limits during the graceful consolidation, but it doesn't seem perfect either:

  • some users might expect that the autoscaler never exceeds the limits, even temporarily
  • some users might be constrained by external factors, such as IPs, licensing (as mentioned in the doc you linked), so exceeding the limits might be simply not possible
  • user can't control by how much the limits are exceeded. User might want to specify something like "I want to allow exceeding the limit temporarily during the consolidation, but only by x nodes". x could be defined as a difference between soft and hard limits

Having that said, I can see the potential usefulness of having both soft and hard limits, even if it makes the API slightly more complicated.

Would you suggest having an API like that?

...
limits:
  soft:
    resources:
      nodes: 8
  hard:
    resources:
      nodes: 12

@towca @x13n do you have any thoughts about that?

@ellistarn
Copy link
Contributor

  • some users might expect that the autoscaler never exceeds the limits, even temporarily
  • some users might be constrained by external factors, such as IPs, licensing (as mentioned in the doc you linked), so exceeding the limits might be simply not possible
  • user can't control by how much the limits are exceeded. User might want to specify something like "I want to allow exceeding the limit temporarily during the consolidation, but only by x nodes". x could be defined as a difference between soft and hard limits

Agree with these. Personally, I am not sure I am convinced by the soft/hard proposal. As you mention, there are other factors (like IPs) that cause limits as well. I suspect we will be forced to "terminate-before-launch" when at those limits (within PDBs).

I lean towards a solution that treats limits as best-effort, where there are cases (i.e. surge updates) where the limits can temporarily be exceeded. I'm not sure the state of the art today, but when we first released limits in Karpenter, it was best effort due to launch parallelism.

Mostly, I wanted to highlight this problem and get you in touch with the karpenter folks who are thinking about this (@maxcao13, @jmdeal, @jonathan-innis). I want to avoid Karpenter's limit semantics diverging from SIG Autoscaling API standards unless absolutely necessary.

@maxcao13
Copy link
Member

Thanks for the ping.

Yeah, I have a similar proposal for Karpenter itself, since it does not support node limits for non-static capacity: kubernetes-sigs/karpenter#2525

I'm proposing soft/hard limits in a similar way. The API fields are slightly different because of backwards compatibility concerns, but generally the semantics I am agreeing with where soft limits can temporarily exceed the limit, but hard limits definitively constrain nodes from ever going over.

If we can agree on one semantic across both proposals that would be ideal. For now I don't see a reason why it would be necessary to differ.

@norbertcyran
Copy link
Contributor Author

If we can agree on one semantic across both proposals that would be ideal. For now I don't see a reason why it would be necessary to differ.

Yeah, that would be ideal. I think the only difference is that we don't need the distinction between hard and soft limits yet in CAS, but we might need in the future, and I believe it's not a great cost to add it.

Would you agree on the API suggested in #8702 (comment)? I'd prefer that API over limits and hardLimits, as there are no backwards compatibility concerns.

Do you think you'll implement this API at some point in the future on the Karpenter side?

@maxcao13
Copy link
Member

maxcao13 commented Nov 4, 2025

Would you agree on the API suggested in #8702 (comment)? I'd prefer that API over limits and hardLimits, as there are no backwards compatibility concerns.

Well for Karpenter, limits already exist in the spec of NodePools, so it would probably look like:

spec:
  limits:
    soft:
      nodes: 10
    hard:
      nodes: 12
    # Soft limits can still be specified at the top-level for backwards compatibility, 
    # but this approach would no longer be documented.
    cpu: 10

like what was proposed in kubernetes-sigs/karpenter#2525 (comment).. But it doesn't matter too much whether the API looks the same or not, just that intended semantic behaviour is consistent.

This is a question just about the proposal in this PR, Is there a reason why there needs to be a resources field in spec.limits.resources.<resource>?

Do you think you'll implement this API at some point in the future on the Karpenter side?

Would need to defer to the Karpenter reviewers/maintainers on this one :-)
But it's been noted in the PR review that hard limits for Karpenter are valuable for the future in cases for other resources, e.g. GPUs, and current Karpenter limit semantic only means "soft" at the current state.

@ellistarn
Copy link
Contributor

ellistarn commented Nov 4, 2025

Yeah, that would be ideal. I think the only difference is that we don't need the distinction between hard and soft limits yet in CAS, but we might need in the future, and I believe it's not a great cost to add it.

To be clear, hard/soft is @maxcao13's proposal for Karpenter. I am not convinced that making this distinction is the right direction. I am curious to explore https://kubernetes.io/docs/concepts/policy/resource-quotas/#quota-scopes as an answer to the launch before terminate problem. I could imagine customers including or excluding disruption reasons (i.e. drift/underutilized, cc: @jmdeal for commentary) as a way to achieve "soft" quotas.

Do you think you'll implement this API at some point in the future on the Karpenter side?

If the shape of the API works for the Karpenter product experience, it's definitely preferable to support a standard -- I like how it's shaping up :). If we go this route, I could see these AutoscalingResourceQuotas being the preferred mechanism for specifying limits instead of nodepool.spec.limits. Specifically, because this more flexible approach is a good answer to a 3 year old request kubernetes-sigs/karpenter#745. We will, of course, need to maintain support for compatibility, and I imagine some customers may prefer the simplicity of the nodepool scoped API.


```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep brainstorming this name. AutoscalingResourceQuota applies specifically to nodes, but Autoscaling is a much more broad domain (i.e. Pod Autoscaling, Volume Autoscaling, etc).

If I remember correctly, we shot down my preferred NodeResourceQuota in a previous doc, but I forget why. We had a similar naming debate on https://github.com/kubernetes/autoscaler/blob/9e22656b32975da37aad7c937031cc8ab1796b91/cluster-autoscaler/proposals/buffers.md (cc: @jbtk). Maybe there's a similar naming solution in CapacityResourceQuota, as we're more or less referring to node.status.capacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we shot it down, apparently the thread just got lost: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?disco=AAABsMJ0d48

Let's continue the discussion here. Tl;dr from the thread, why I'm leaning more towards AutoscalingResourceQuota:

  • It's explicit that its scope is limited only to autoscaling - the limits will be respected only by node autoscalers, users will still be able to register the nodes manually (side note: I guess it's up to you to decide how it's going to work with Karpenter static node pools)
  • This name is more future-proof if we were to extend this resource to limit DRA devices. With DRA support, NodeResourceQuota would be a bit inaccurate, since devices != nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the limits will be respected only by node autoscalers

Do you think that CAPI might ever support this API? The ResourceQuota is used more broadly than Pod Autoscaling, it's general purpose for anything applying pods (and other namespaced resources) to the cluster.

This name is more future-proof if we were to extend this resource to limit DRA devices. With DRA support, NodeResourceQuota would be a bit inaccurate, since devices != nodes

This leans me even further towards CapacityResourceQjuota.

Copy link
Member

@maxcao13 maxcao13 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents, but I think as long as the the crd apiVersion is from autoscaling.x-k8s.io/v1beta1 or something, then something like Capacity/NodeResourceQuota makes sense as to which controllers would respect it, and less confusion on how general purpose the custom resource is. Autoscaling*Quota may be redundant in this sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that CAPI might ever support this API?
I don't know much about CAPI, to be honest. Do you see any potential use cases?

Aren't Capacity and Resources pretty much interchangeable in this context? CapacityResourceQuota sounds a bit redundant to me. Alternatively, InfrastructureQuota or just CapacityQuota would also make sense here. WDYT?

@towca @x13n do you have any preferences here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for CapacityQuota. Came here to suggest it and you already had ;)

selector:
matchLabels:
example.cloud.com/machine-family: e2
limits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider hard here to conceptually align with https://kubernetes.io/docs/concepts/policy/resource-quotas/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially if we agree on hard/soft limits distinction. Let's see how the discussion goes and let's get back to this one

topology.kubernetes.io/zone: us-central1-b
limits:
resources:
nodes: 64
Copy link
Contributor

@ellistarn ellistarn Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look at how ResourceQuotas do this? https://kubernetes.io/docs/concepts/policy/resource-quotas/#quota-on-object-count

spec:
  hard:
    cpu: 100
    count/nodes: 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about it. count/<resource> syntax counts the objects in API server and is mainly used to protect against exhaustion of control plane storage. For some resources, ResourceQuota also supports a specialized syntax. For instance, you can set count/pods and pods, and these two can mean different things (and have different use cases). count/pods will count all the pods in the namespace, whereas pods will only count pods in non-terminal state in the namespace. Therefore, the former seems to be useful in terms of limiting the storage, and the latter handles the actual business logic, where you only care about active pods (for example due to pod IP space constraints).

Following that logic, nodes seems more suitable to me than count/nodes, as we don't simply count Node objects in the API server. In case of CAS, during scale up operations we also want to count the upcoming nodes, so the nodes that had been provisioned in previous loops, but have not been yet registered. In case of Karpenter. you might want to count NodeClaims in a similar scenario. Also in CAS we plan to allow cloud providers to have custom node filtering logic (as mentioned in #8702 (comment)), for example to filter out surge upgrade nodes in GKE.

Other than that, I think it's just more convenient as an API.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the distinction vs the specific pods vs the generic pods/count. It seems that ResourceQuota evolved from gating Node resources to generic Kubernetes resources. Interesting sleight of hand :).

If we consider the API to be "node specific", which I think is fair, I could see making this field just noderesourcequota.spec.hard.count = 3. If it's not node specific, then something like capacityresourcequota.spec.hard.nodes = 3 might make more sense.

Not too fussed on the differences between those, but I agree that the semantic is different than the "storage limit" you raised. If we supported something like count/nodes, we might find ourselves operating on any global resource count/validatingadmissionpolicy, which is definitely not the intent of this API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming it just count might be not descriptive enough, as there are limits on other resources too. However, capacityresourcequota.spec.hard.nodes makes sense to me. I guess in such case it would look like this:

spec:
  limits:
    nodes: 3
    resources:
      cpu: 16

The question is whether moving nodes outside of resources improves this API in any way. I'd probably stick with keeping it in resources to keep it simple, but I won't insist on that if you have a strong preference

memory: 256Gi
```

* `selector`: A standard Kubernetes label selector that determines which nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finally realized why you called it scopeSelector, it's because it's in ResourceQuota 🤦‍♂️ .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly, though @jonathan-innis noted that scopeSelector in ResourceQuota is a bit different, since it's not just a label selector, but rather a named filter that you need to reference via scopeName field. Here we just want to use a plain label selector. Therefore, indeed scopeSelector might not be an accurate name, and probably it's better to go with selector or nodeSelector (though this one might be inaccurate if we were to support DRA)

@norbertcyran
Copy link
Contributor Author

This is a question just about the proposal in this PR, Is there a reason why there needs to be a resources field in spec.limits.resources.?

@maxcao13 That might seem redundant, yes. The main reason was to account for possible future support for DRA limits. There was a discussion around this here: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?disco=AAABsjPQRlI

In general, if we were to support DRA limits, we would probably put limits on devices. We will not be able to simply put them at the same level of nesting as other resources. Therefore, we have 2 options:

  1. We account for that and nest limits under resources. If there will be plans to add DRA limits, we can just add a new field under limits, next to resources (current proposal):
spec:
  limits:
    resources:
      cpu: 64
    draDevices: ... # possibly added in the future
  1. We ignore it now and in the future we will probably have to add a top-level field for DRA limits:
spec:
  limits:
    cpu: 64
  draLimits: ...

API-wise, option 1 looks cleaner to me, even if it leaves more boilerplate initially. I'm open to other suggestions, though.

Comment on lines +87 to +91
usage:
cpu: 32
memory: 128Gi
nodes: 50
```
Copy link
Member

@maxcao13 maxcao13 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ResourceQuota uses status.used. Should we keep things consistent and/or is there a reason why this differs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants