Skip to content

Conversation

bernot-dev
Copy link

What type of PR is this?

What this PR does / why we need it:

/kind feature

Enhancement proposal for VPA behavior for DaemonSets. See change for details.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/vertical-pod-autoscaler labels Mar 18, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bernot-dev
Once this PR has been reviewed and has the lgtm label, please assign jbartosik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @bernot-dev. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 18, 2025
@adrianmoisey
Copy link
Member

/retitle AEP-7942 Vertical Pod Autoscaling for DaemonSets with Heterogeneous Resource Requirements

@k8s-ci-robot k8s-ci-robot changed the title add vpa enhancement proposal AEP-7942 Vertical Pod Autoscaling for DaemonSets with Heterogeneous Resource Requirements Mar 18, 2025
@bernot-dev
Copy link
Author

/assign

@omerap12
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 18, 2025
Copy link
Member

@omerap12 omerap12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more about it, I don’t think this is a good idea. Here’s why:

  1. DaemonSets are supposed to be the same on all nodes, and this change goes against that rule.
  2. There are different types of nodes, and it would be very hard to decide how each DaemonSet pod should work on each one.
  3. Nodes in a cluster can be replaced often (like spot instances). If DaemonSets behave differently on each node, it could cause problems when nodes change.

Because of this, I think this idea might create more problems than it solves. It would make things more complicated and harder to manage.

But again, this is my opinion.. maybe others think differently :)

@bernot-dev
Copy link
Author

@omerap12 Thanks for your comments. I'll respond to each point individually.


1

According to the documentation, a DaemonSet ensures that all (or some) Nodes run a copy of a Pod. I am not aware of any promise of "sameness" beyond that.

There is a reference to using multiple DaemonSets to specify different resource requests. This proposal aims to ease the operational burden of managing multiple DaemonSets that would otherwise need to be created strictly to manage resource allocation.

In addition, this proposal aims to address situations where determining the correct resource allocations in advance is infeasible because it depends on other workloads within the cluster. As another example, consider a cluster storage daemon. The correct resource allocations have little to do with characteristics of the node. Rather, it depends how storage is used by other workloads scheduled to the node. If none of the pods scheduled to the node use storage at all, then minimal resources are needed. If other pods scheduled to the node are heavy storage users, the daemon may need a large resource allocation. Without the changes described in this proposal, that operational burden falls to humans. This proposal allows that operational burden to shift to Kubernetes.


2

The type of node is not considered under this proposal. Only the actual usage history is relevant. This is consistent with the current behavior of VPA, which also does not consider node types. I agree that considering types of nodes with VPA would add additional complexity. There may be use cases for that kind of behavior, but I consider that concern out of scope for the current proposal.


3

There are several reasons a Kubernetes user may want to use incorporate spot instances into their cluster. Making their cluster easier to manage is not one of them. I think there are probably some cases where VPA over spot instances makes sense, and probably many where it doesn't. At any rate, this proposal does not close the door to any existing options for managing workloads, including over spot instances.

@omerap12
Copy link
Member

@bernot-dev
Maintaining a 1:1 relationship between DaemonSet pods and nodes would be challenging, especially in dynamic environments where nodes are frequently replaced (like with spot instances), this is why I initially thought we'd be discussing node types rather than individual nodes.
I think there's an important point we should consider - If you have nodes with many pods and other nodes with few pods, the best approach might be to understand why this happens and it might be better to have separate DaemonSets for these different scenarios and let VPA track two different DaemonSets.

@alvaroaleman
Copy link
Member

I think there's an important point we should consider - If you have nodes with many pods and other nodes with few pods, the best approach might be to understand why this happens and it might be better to have separate DaemonSets for these different scenarios and let VPA track two different DaemonSets.

I disagree, at that point you are basically prescribing that a k8s cluster should only have homogenous workloads in order to keep pods per node the same for all workloads.

The problem this design aims to solve is very real - Most k8s clusters do not have a homogenous set of nodes and/or workloads and most agents which is what DS are used for scale relative to the number and activity of the pods - Think log gatherers, metrics gatherers or security daemons that track syscalls. I do agree on the general point though that it will be hard/impossible to make this useful in scenarios where nodes are very short-lived. I also like your idea of using an instance-type average as starting point rather than the DS setting.

@omerap12
Copy link
Member

@alvaroaleman Thanks for your thoughts!
You're right that Kubernetes clusters usually have mixed workloads—sorry if I wasn’t clear. I was referring to extreme cases where pods are distributed very unevenly across nodes. But I agree that the problem this design aims to solve is real.

The challenge is implementing this in environments where nodes are short-lived. Even with the node type idea, it could still be tricky. For example, with Karpenter, the nodePool settings could generate many different node types, making them hard to track. I also considered using node labels, but that could lead to different node types sharing the same labels, which would be problematic as well.

@bernot-dev
Copy link
Author

It seems like there may be disagreement on this non-goal:

Use data from other nodes to make an informed recommendation when a pod is added to a DaemonSet with VPA enabled after a new node is added to a cluster.

There are many ways that you could try to estimate resource requests for a new pod that is created when a node node is added to a DaemonSet. However, in the options that I have considered, there is not a clear "winner."

Using node types might be better than nothing in some scenarios, but it comes at the additional cost of tracking usage over each node type grouping. Furthermore, there is no guarantee that the node type has any relationship whatsoever with the daemon's workload. It is a rough heuristic that may suit some use cases, but certainly is not universally applicable. Also, it raises the question of what should happen when a new node type is introduced to the cluster. I don't believe there is an obvious answer for what the user would expect in that scenario.

Another option could be applying the average over the entire DaemonSet, instead of a node type average. Again, it's possible you would get a better recommendation from the outset, but it is not guaranteed.

Yet another option would be setting a resource request proportional to the node size. So, a 32M memory request on a node with 4G of memory translates to a 128M (4x) request on a node with 16G (4x) of memory capacity. You could even do maths to determine a weighted average over different node sizes.

Each of these options assume that knowing the size of the node tells you something about the workload of the Daemon. I reject that assumption. I understand there likely exists a weak correlation between the resource consumption of a Daemon and the size of the node it is on for typical workloads. However, I do not believe that the correlation is strong enough to justify adding complexity to VPA.

Furthermore, the lifespan of the initial resource requests is expected to be short under VPA. Making a better guess at the initial resource allocation may affect the pod for O(minutes) before it receives a recommendation founded in the desired data. The impact of the initial recommendation is minimal, except in short-lived workloads, which are probably not a good fit for VPA to begin with.

Finally, a new node may not have workloads scheduled immediately. If a node is added and a corresponding log gatherer daemon is added, that log gatherer will have very little work to do until other workloads are added that start producing logs. That could happen quickly or it could happen slowly. It could be the case that the first pod scheduled is a application under load with extremely verbose logging enabled. Or it could just be that the workloads scheduled only produce sparse logs, and it never needs to scale up. Making a guess based on what is happening on other nodes does not necessarily improve the guess.

For these reasons, I affirm my position that this should remain a non-goal.

@voelzmo
Copy link
Contributor

voelzmo commented Mar 25, 2025

@bernot-dev Thanks for the proposal, I really think this would be a useful addition! I've seen the very same problem that you're describing: DaemonSet resource consumption can depend heavily on what specific Pods are on a Node and what they do (lots of outgoing/incoming traffic, lots of writes to disk, lots of logs produced, etc.).
@omerap12 is right in the sense that currently VPA treats all entities under control of the targetRef to be the same, that's why all utilization samples are treated the same and all the resource recommendations are the same.

As pointed out by this proposal, this may not be true for specific DaemonSets and you still want VPA to adjust the requests according to usage, but on a per-Pod (i.e. per-Node) level. If you want/need this, you could enable this with the new Scope field. If you don't need it, because you haven't experienced this with your DaemonSets, you don't use it.

@bernot-dev I'm interested a bit in the details of implementing this: While it seems a reasonable enough change on a high level, it also seems quite intrusive to the way vpa-recommender is currently aggregating usage samples, making recommendations and storing them. Have you put together some prototype to proof this is feasible already or some rough ideas on how to make this work?

In the end, we would probably need to

  • make sure that the individual DaemonSet Pods end up having separate aggregateStateKeys for them to get individual histograms
    • Currently, the aggregateStateKey is composed of values which are the same for all Pods, including for example the Pod's labels:
      // getLabelSetKey puts the given labelSet in the global labelSet map and returns a
      // corresponding labelSetKey.
      func (cluster *ClusterState) getLabelSetKey(labelSet labels.Set) labelSetKey {
      labelSetKey := labelSetKey(labelSet.String())
      cluster.labelSetMap[labelSetKey] = labelSet
      return labelSetKey
      }
      // MakeAggregateStateKey returns the AggregateStateKey that should be used
      // to aggregate usage samples from a container with the given name in a given pod.
      func (cluster *ClusterState) MakeAggregateStateKey(pod *PodState, containerName string) AggregateStateKey {
      return aggregateStateKey{
      namespace: pod.ID.Namespace,
      containerName: containerName,
      labelSetKey: pod.labelSetKey,
      labelSetMap: &cluster.labelSetMap,
      }
      }
  • make sure that we either get additional VPA objects generated, or introduce an additional way to create multiple recommendations per VPA and store multiple recommendations in a single VPA status.
    • Currently, recommender creates a single recommendation per VPA object:
      resources := r.podResourceRecommender.GetRecommendedPodResources(GetContainerNameToAggregateStateMap(vpa))
    • Currently, there is a single recommendation per VPA in the .status which is applied to all Pods controlled by this VPA
    • Probably we would need a way to generate a VPA per Node and keep it up-to-date with Node autoscaling with something like CA or Karpenter or users modifying the Node count themselves
  • make sure that vpa-updater understands which Pods to evict. Maybe this is a no-op if we end up having multiple VPA objects and there is a clear relationship between a single VPA and the Pod that needs to be evicted.

Answers to these open points will most likely have a huge effect on VPAs API, therefore I think we should clarify them a bit more before we can merge this proposal.

PS: I saw this part in the proposal

When this feature is enabled, the internal representation of the VerticalPodAutoscaler will be divided into N objects, where N is the number of Nodes with a pod scheduled. When the recommendation is produced and exposed on the VerticalPodAutoscaler, the recommendation will include the recommendation generated for each node. Aside from these differences, the mechanics of VerticalPodAutoscaling will remain fundamentally unchanged.

When using Metrics Server, the VerticalPodAutoscalerCheckpoints will also be divided into separate resources, by the node scope. When the storage is Prometheus, VerticalPodAutoscalerCheckpoints are not created.

Internally, the ClusterState will track a separate VPA for each schedulable node of the DaemonSet. Nodes that with taints that are not tolerated or that otherwise do not meet the nodeSelector criteria for the DaemonSet in the TargetRef will be ignored.

but this is too high level for me to understand what the changes on the internals and the API would be. In a bit more detail

When the recommendation is produced and exposed on the VerticalPodAutoscaler, the recommendation will include the recommendation generated for each node.

Does this mean you are proposing to change .status.recommendation to not only contain containerRecommendations, but also introduce a .node level or similar? If this is true, we'd also need some more modifications to the updater and admission-controller, as mentioned above.

When using Metrics Server, the VerticalPodAutoscalerCheckpoints will also be divided into separate resources, by the node scope. When the storage is Prometheus, VerticalPodAutoscalerCheckpoints are not created.

Oh, right, I forgot about that in my list above. Currently, we have a VPACheckpoint object per VPA and container, now we would need one per Node and also change the initialization logic to feed this into the internal histogram state on recommender reboot!

@adrianmoisey
Copy link
Member

It seems like there may be disagreement on this non-goal:

Use data from other nodes to make an informed recommendation when a pod is added to a DaemonSet with VPA enabled after a new node is added to a cluster.

There are many ways that you could try to estimate resource requests for a new pod that is created when a node node is added to a DaemonSet. However, in the options that I have considered, there is not a clear "winner."

I do find this non-goal to be a small step backwards.

At the moment we use VPAs for DaemonSets. The results we get are mostly-ok, but certainly not perfect.
This AEP will definitely improve on that as we get recommendations per node.

However, it's now up to the user to pick the starting value for the pod, rather than allowing for the VPA to do it.
Is there a reason why new nodes (with no VPA history) don't fall back to the current behaviour?

Comment on lines 361 to 366
+ // Scope indicates at what level the recommendations should be calculated and applied.
+ // By default, the scope is all pods under the controller specified in `TargetRef`. Scope
+ // may also be set to "node" when the TargetRef is a DaemonSet, to apply recommendations
+ // independently to the pod scheduled to each node under the DaemonSet.
+ // +kubebuilder:validation:Enum=;node
+ Scope VPAScope `json:"scope,omitzero" protobuf:"bytes,4,opt,name=scope"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the spacing here doesn't match the top part of this struct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to expand this to be more general. For example, copying topologySpreadConstraints and allowing for any topologyKey ?
I assume it doesn't change the solution much, and may allow for more options in the future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really interesting idea.

kubernetes.io/hostname definitely aligns with the goals here. I can't quite picture what problem is solved by using topology.kubernetes.io/zone or topology.kubernetes.io/region with VPA. But maybe there's something that makes sense.

It seems there are some references to topologyKey being deprecated. Not sure what the story is there. Do you know if it is still supported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes.io/hostname definitely aligns with the goals here. I can't quite picture what problem is solved by using topology.kubernetes.io/zone or topology.kubernetes.io/region with VPA. But maybe there's something that makes sense.

Yeah, this is where I landed too. I really can't see any other use other than node, but may be somebody does have a use case that we just don't know about. It also feels like topologyKey is close to what we want.

I'm very much on the fence on this idea. I'm curious what others think?

It seems there are some references to topologyKey being deprecated. Not sure what the story is there. Do you know if it is still supported?

I don't know the history here. topologyKey is used in topologySpreadConstraints, and that doesn't seem to be going away

@bernot-dev
Copy link
Author

Prototype from @bboreham

#7978

@bernot-dev
Copy link
Author

However, it's now up to the user to pick the starting value for the pod, rather than allowing for the VPA to do it.
Is there a reason why new nodes (with no VPA history) don't fall back to the current behaviour?

The user should pick a starting value for resources in the DaemonSet in any case. Especially because requests and limits remain proportional.

The current behavior is cluster-scoped. With the node-scoped behavior, we don't have to track/query the cluster-scoped usage at all. To implement a starting value based on other nodes, we would have to track the aggregated usage in addition to the individual usage to fall back on unscoped behavior. This would be ongoing work that would only be used occasionally.

Also, as I discussed previously, I also question the value of the initial recommendation. There's no particular reason to believe that common DaemonSets would immediately (or ever) scale to levels similar to other nodes. A daemon that collects observability signals will need other workloads to observe first.

Compared to (for instance) adding a pod to a deployment with a load balancer in front of it, many common DaemonSets will have much less predictable workloads.

@omerap12
Copy link
Member

omerap12 commented Mar 26, 2025

Thanks so much for this, both @bernot-dev and @bboreham!

If I understand correctly (and please correct me if I’m wrong - I haven’t had time to dive deep, but I did review the idea briefly), we are defining recommendations for DaemonSet pods in a 1:1 pod-to-node relationship.

From what I see in the PR, it’s not strictly limited to DaemonSets, but let’s set that aside for now. Essentially, if pod X is scheduled on node Y, the recommendation is adjusted accordingly. That means this recommendation doesn’t account for other DaemonSet pods, correct? Since each pod is scheduled on a different node, they wouldn’t influence each other.

What happens when a node deregisters from the cluster and a new node is added? Does the recommendation process restart in that case?

I think this approach makes a lot of sense for static clusters where nodes are permanent, but I’m curious how it behaves in cloud provider environments where nodes are more dynamic.

Edit: I had another idea - rather than using the node name or label, we could apply a hash function to the CPU and RAM allocatable resources of the node. I believe this approach and addresses the issues I mentioned earlier.

@adrianmoisey
Copy link
Member

However, it's now up to the user to pick the starting value for the pod, rather than allowing for the VPA to do it.
Is there a reason why new nodes (with no VPA history) don't fall back to the current behaviour?

The user should pick a starting value for resources in the DaemonSet in any case. Especially because requests and limits remain proportional.

I agree that the user should pick a starting value, but at the moment there is no requirement for a starting value to be picked, which makes this change feel backwards incompatible.

For example, if someone has a cluster-scoped DaemonSet VPA configured, and the DaemonSet spec is configured without Pod resources, any new Pod will be created with resources set, thanks to the VPA.

However, if they were to change that VPA to use a node-scoped DaemonSet VPA, then new Pods will start with no resources set, until enough time has passed and the node-scoped recommendation is applied.

I worry that a user will naively enable the node-scoped feature, without knowing that this will change the behaviour of the VPA in a backwards incompatible way.

The current behavior is cluster-scoped. With the node-scoped behavior, we don't have to track/query the cluster-scoped usage at all. To implement a starting value based on other nodes, we would have to track the aggregated usage in addition to the individual usage to fall back on unscoped behavior. This would be ongoing work that would only be used occasionally.

Why can't this new feature be implemented in a way to track both cluster-scoped and node-scoped metrics? This ongoing work will be used every time a new node is created. For any cluster with autoscaling enabled, this possibly happens many times a day.

Also, as I discussed previously, I also question the value of the initial recommendation. There's no particular reason to believe that common DaemonSets would immediately (or ever) scale to levels similar to other nodes. A daemon that collects observability signals will need other workloads to observe first.

From what I've observed, a DaemonSet Pod only exists because the workload needs it.
Let me explain what I mean here...
An HPA scales up, causing a new Pod for my Deployment workload to go into Pending.
This then causes a new node to be created and for that workload to get scheduled.
This new node triggers a new DaemonSet pod.

if enough Pods are created at the same time, this node could start its life already full.
I've seen situations where a DaemonSet pod can't schedule on a brand-new node, since order of scheduling isn't guaranteed, and the workload just happened to be scheduled first.

@raywainman
Copy link
Member

What if we also deployed a second VPA that is cluster-scoped and in mode = INITIAL so that new pods still had a different value applied? We could then have clearly documented prioritization order in the admission-controller that any recommendation that is node scoped would take precedence?

Trying to have a single VPA represent all of this sounds a bit complicated from the API perspective.

@adrianmoisey
Copy link
Member

Trying to have a single VPA represent all of this sounds a bit complicated from the API perspective.

I wonder if it makes sense to take a step back and figure out what we want from this and the right API to build it on top of?

I wonder if in-place will allow other types of workloads (Deployments, StatefulSets, etc) to have per-pod recommendations, since the cost of resizing is now lower.

I don't know if that's a possibility in our future, but it may make sense to build this feature to allow for future development.

@bboreham
Copy link
Contributor

we could apply a hash function to the CPU and RAM allocatable resources of the node

This would not work in the case where a node with the same number of CPUs goes at a different rate.
At work we recently went through a phase of running one cluster on AWS m5, m5a, m6g, m7g and other nodes, and I can report that they have materially different performance.

@omerap12
Copy link
Member

we could apply a hash function to the CPU and RAM allocatable resources of the node

This would not work in the case where a node with the same number of CPUs goes at a different rate. At work we recently went through a phase of running one cluster on AWS m5, m5a, m6g, m7g and other nodes, and I can report that they have materially different performance.

I understand. so do you have any ideas regarding the issues I brought up?

@bboreham
Copy link
Contributor

scope= node.kubernetes.io/instance-type

This is a bit more complicated to implement than node name, as it requires an extra api call from the admission-controller to see the labels. However that is a minor concern.

@omerap12
Copy link
Member

scope= node.kubernetes.io/instance-type

This is a bit more complicated to implement than node name, as it requires an extra api call from the admission-controller to see the labels. However that is a minor concern.

I see. I’m just wondering how this behaves in a large cluster with many frequently replaced instances.

@verejoel
Copy link

verejoel commented Apr 2, 2025

Would it be too far-fetched to expand this proposal to also include StatefulSets?

The Story supporting this use-case comes again from Observablity, this time considering the distributor/ingester setup on the write-path, which is common across a number of projects (Thanos, Cortex, Mimir, for example).

Here requests are routed to individual pods of the StatefulSet by hashing some property of the incoming data (for example, a label set). When you also introduce replication, this can lead to some quite unbalanced resource usage across the pods in the StatefulSet.

Per-pod resource recommendations would be tremendously valuable in this case.

@adrianmoisey
Copy link
Member

Would it be too far-fetched to expand this proposal to also include StatefulSets?

I think it's a reasonable request, but, at the moment I'm not sure if the admission-controller knows which host the pod will land on. DaemonSets have the host baked into the Pod spec, but I'm unsure on StatefulSets.

May be the in-place feature will allow this in the future?

@bernot-dev
Copy link
Author

One problem this enhancement aims to solve is scaling daemons on substantially different node sizes.

If the initial recommendation is based on the cluster average, you could end up with unschedulable pods. For instance, the cluster average is 10GB of memory per node for the pods in a DaemonSet and a new node only has 4 GB of memory capacity. If there's not a way to specify that you want a different behavior for the initial request, this is a big problem because the pod will presumably never be scheduled and given the opportunity to autoscale. For this reason, this would at least need to be configurable to avoid negating the benefits of this enhancement.

Let's consider a couple ways an API for configurable initial recommendations could be implemented:

  1. Multiple Scope options: (node, node-with-cluster-initial). This doesn't require an additional field, but it gets pretty ugly and creates a combinatorial explosion if scope options expand in the future.
  2. Additional InitialScope field: (Scope: node, InitialScope: cluster). This is cleaner, but still has some problems...

In the second option, you would want the default to be spec to maintain backwards compatibility with existing VPAs. Imagine we eventually implement most of the scopes that have been discussed/requested. Some combinations of scopes would make sense, and some are questionable.

IntialScope -> Default/Spec Node Node Pool Instance Type Machine Family kubernetes.io/os Cluster
Scope: Default/Cluster
Scope: Node
Scope: Node Pool
Scope: Instance Type
Scope: Machine Family
Scope: kubernetes.io/os

I don't think it would make sense to include Node as an InitialScope option because it wouldn't be expected to create a meaningful recommendation, except perhaps in a future scenario where scope is expanded to include other controller types where multiple pods are scheduled on the same node.

Setting a more restrictive scope for the initial recommendation than the ongoing VPA recommendations is not meaningful because the recommendation at the larger scope will already exist. For instance, it's hard to imagine a reasonable scenario where you would want to use the default/cluster scope for recommendations but the initial recommendation to be from a node pool.

Setting scope to NodePool and initial scope to InstanceType seems questionable, and likely a mistake. While there are scenarios where this could result in a better initial recommendation for the first node if the specified instance type in a node pool, this would not be meaningful for the 2nd node added to the node pool.

I'm not sure if it would be best to allow all of the combinations, or try to validate them.

At any rate...InitialScope could be implemented in a follow-up proposal without blocking this proposal. However, I do think spec makes more sense as an implicit default to avoid the unschedulable pod problem until InitialScope is implemented.

@adrianmoisey
Copy link
Member

At any rate...InitialScope could be implemented in a follow-up proposal without blocking this proposal. However, I do think spec makes more sense as an implicit default to avoid the unschedulable pod problem until InitialScope is implemented.

I agree that it makes sense, I worry that it’s a big change from the existing behavior.

if someone has a VPA pointing at a DaemonSet currently, and changes it to be a node-scoped VPA, the behaviors of the VPA changes considerably.
Before the change the new Pods would get cluster scoped resources, after the change they would get what they have defined in spec, until the node recommendation is available.

@bernot-dev
Copy link
Author

if someone has a VPA pointing at a DaemonSet currently, and changes it to be a node-scoped VPA, the behaviors of the VPA changes considerably.

A change in behavior is intended.

It's worth noting the under current VPA behavior. If there is not enough data available for a confident recommendation, then the admission webhook does not apply a recommendation. This existing fallback is the behavior the current proposal emulates, rather than attempting to leverage data that the user has explicitly indicated is out of scope.

The behavior for a scope that includes multiple pods would "feel" more similar to existing behavior. For instance, in a scenario where a NodePool scope is supported, a DaemonSet with VPA scoped to the NodePool level that adds a 10th node to a node pool and schedules a corresponding pod would immediately get a recommendation from the 9 other nodes.

The scope is precisely intended to convey that only other pods within the same scope are relevant to vertically scaling the workload.

@adrianmoisey
Copy link
Member

adrianmoisey commented Apr 15, 2025

if someone has a VPA pointing at a DaemonSet currently, and changes it to be a node-scoped VPA, the behaviors of the VPA changes considerably.

A change in behavior is intended.

I worry about this: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#on-compatibility

does not change existing semantics, including:
the semantic meaning of default values and behavior

This makes me wonder if extending the existing CR is the right thing to do, or does it make more sense to create a new "scoped" VPA CR?

Or I'm misinterpreting the API guidelines.

@bernot-dev
Copy link
Author

The Adding a Field example describes this case.

It is generally allowed to add new fields without changing the API version

The onus is on you to define a sane default value

In the proposal, the default value of Scope maintains existing behavior. I don't see the problem.

@bernot-dev
Copy link
Author

if someone has a VPA pointing at a DaemonSet currently, and changes it to be a node-scoped VPA, the behaviors of the VPA changes considerably.

A change in behavior is intended.

I worry about this

To be more specific, a change in behavior is intended when you change the VPA spec. Changing a field/value in the VPA spec without causing any change in behavior would be useless.

@omerap12
Copy link
Member

One problem this enhancement aims to solve is scaling daemons on substantially different node sizes.

If the initial recommendation is based on the cluster average, you could end up with unschedulable pods. For instance, the cluster average is 10GB of memory per node for the pods in a DaemonSet and a new node only has 4 GB of memory capacity. If there's not a way to specify that you want a different behavior for the initial request, this is a big problem because the pod will presumably never be scheduled and given the opportunity to autoscale. For this reason, this would at least need to be configurable to avoid negating the benefits of this enhancement.

Let's consider a couple ways an API for configurable initial recommendations could be implemented:

  1. Multiple Scope options: (node, node-with-cluster-initial). This doesn't require an additional field, but it gets pretty ugly and creates a combinatorial explosion if scope options expand in the future.
  2. Additional InitialScope field: (Scope: node, InitialScope: cluster). This is cleaner, but still has some problems...

In the second option, you would want the default to be spec to maintain backwards compatibility with existing VPAs. Imagine we eventually implement most of the scopes that have been discussed/requested. Some combinations of scopes would make sense, and some are questionable.

IntialScope -> Default/Spec Node Node Pool Instance Type Machine Family kubernetes.io/os Cluster
Scope: Default/Cluster ✅ ❌ ❌ ❌ ❌ ❌ ✅
Scope: Node ✅ ❌ ✅ ✅ ✅ ❓ ✅
Scope: Node Pool ✅ ❌ ✅ ❓ ❓ ❓ ✅
Scope: Instance Type ✅ ❌ ❌ ✅ ❓ ❓ ✅
Scope: Machine Family ✅ ❌ ❌ ❌ ✅ ❓ ✅
Scope: kubernetes.io/os ✅ ❌ ❌ ❌ ❌ ✅ ✅
I don't think it would make sense to include Node as an InitialScope option because it wouldn't be expected to create a meaningful recommendation, except perhaps in a future scenario where scope is expanded to include other controller types where multiple pods are scheduled on the same node.

Setting a more restrictive scope for the initial recommendation than the ongoing VPA recommendations is not meaningful because the recommendation at the larger scope will already exist. For instance, it's hard to imagine a reasonable scenario where you would want to use the default/cluster scope for recommendations but the initial recommendation to be from a node pool.

Setting scope to NodePool and initial scope to InstanceType seems questionable, and likely a mistake. While there are scenarios where this could result in a better initial recommendation for the first node if the specified instance type in a node pool, this would not be meaningful for the 2nd node added to the node pool.

I'm not sure if it would be best to allow all of the combinations, or try to validate them.

At any rate...InitialScope could be implemented in a follow-up proposal without blocking this proposal. However, I do think spec makes more sense as an implicit default to avoid the unschedulable pod problem until InitialScope is implemented.

I agree that InitialScope makes sense as a follow-up. It keeps the first version simple, and using spec as the default helps avoid unschedulable pods for now.

@adrianmoisey
Copy link
Member

The Adding a Field example describes this case.

It is generally allowed to add new fields without changing the API version

The onus is on you to define a sane default value

In the proposal, the default value of Scope maintains existing behavior. I don't see the problem.

Right, valid point.

For my use case, the proposal would be 1 step forward and 1 step backwards, which is why I'd want a cluster scoped initial recommendation.

Comment on lines 405 to 406
When using Metrics Server, the VerticalPodAutoscalerCheckpoints will also be
divided into separate resources, by the node scope. When the storage is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can details be added here? How will these be named? How will garbage collection work? etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine naming the checkpoints will likely involve adding suffix.

VerticalPodAutoscalerCheckpoints are not user-facing, so I consider this an implementation detail that would best be handled when reviewing the proposed implementation, rather than in the design phase. There is no intent to change the way recommendations are calculated, data is stored, or garbage is collected, beyond sub-dividing it into smaller pieces.

If there is a design concern here that needs to be addressed, I'm happy to explore this issue further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VerticalPodAutoscalerCheckpoints are not user-facing, so I consider this an implementation detail that would best be handled when reviewing the proposed implementation, rather than in the design phase.

That's not the intent of the KEP process. The Design Detail section states:

If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them.

My question is asking for clarity on VerticalPodAutoscalerCheckpoints in order to remove any ambiguity.
The idea is to get these sorts of details figured out (and agreed on) before the PR. It also serves as a document stating the decisions we made along the way.

There is no intent to change the way recommendations are calculated, data is stored, or garbage is collected, beyond sub-dividing it into smaller pieces.

How will this subdivide happen? Will the VerticalPodAutoscalerCheckpoints resource change?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2025
@bernot-dev
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.