Skip to content

KEP-1287: Allow memory limit decreases #5368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tallclair
Copy link
Member

For the IPPR beta MVP in v1.33 we decided to punt on figuring out how to do memory limit decreases, and prevent any memory limit decrease.

For v1.34, we want to lift this restriction, and make a best-effort attempt to prevent oom-kills.

Issue: #1287

/milestone v1.34
/assign @natasha41575 @roycaihw

@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 3, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 3, 2025
@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and mrunalp June 3, 2025 21:52
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tallclair
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 3, 2025
@esotsal
Copy link

esotsal commented Jun 4, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from esotsal June 4, 2025 03:09
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
guarantees. Before decreasing container memory limits, the Kubelet will read the container memory
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens, would it try again in the next sync loop iteration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. @tallclair Please correct me if I'm wrong.


Memory limit decreases with `RestartRequired` are still allowed.
_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggesting that pod-level memory limit decrease was already supported in 1.33 for NotRequired policy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move more complex logic elsewhere? I think it would be better to have mem resizing handling in VPA using some mem resizing configs (to be agreed what that means) and apply them to a workload in form of options in VPA Resource Policy.

Copy link
Contributor

@natasha41575 natasha41575 Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with having more complex logic living in controllers such as VPA makes sense, but VPA still won't be able to remove memory limits if we don't relax the restriction in the apiserver. We need to make sure what we do here is reasonable for anyone who may be attempting to remove memory limits even outside the context of VPA or other controllers. The proposal in this KEP seems to me as the most simple thing we can possibly do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thank you for confirmation that we will use VPA for more advanced mem resizing use cases.

It would be great to solve this limitation as part of k8s 1.34 to have mem limit resizing in VPA IPPR (targeting k8s 1.34)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.

I see. 1.33 nodes will not perform container-level memory check before reducing the memory limit, and may result in OOM kill if the target limit is below the current usage. LGTM given the limit reduction operation is best-effort.

In the initial beta release of in-place resize, we will **disallow** `PreferNoRestart` memory limit
decreases, enforced through API validation. The intent is for this restriction to be relaxed in the
future, but the design of how limit decreases will be approached is still undecided.
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
Copy link

@esotsal esotsal Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
If the memory resize restart policy is specified 'NotRequired' ( if not specified, it defaults to `NotRequired` ), the Kubelet will make a

decreases, enforced through API validation. The intent is for this restriction to be relaxed in the
future, but the design of how limit decreases will be approached is still undecided.
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
**best-effort** attempt to prevent oom-kills when decreasing memory limits below container used memory, but doesn't provide any

@@ -891,7 +894,8 @@ This will be reconsidered post-beta as a future enhancement.

### Future Enhancements

1. Allow memory limits to be decreased, and handle the case where limits are set below usage.
1. Improve memory limit decrease oom-kill prevention by leveraging other kernel mechanisms or using
Copy link

@esotsal esotsal Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are current Kubernetes minimum Kernel version requirements sufficient for the kernel mechanisms you are referring to ? Which kernel mechanisms do you plan to leverage ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still TBD. The initial proposal was to leverage memory.high, but there might be some problems when combining that with memory.max. Further exploration is needed, which is why I left this open ended.

cc @roycaihw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will update the kernel version requirement public doc based on the kernel features we use.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying !

/lgtm

If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
guarantees. Before decreasing container memory limits, the Kubelet will read the container memory
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the runtime would be better for this, or even OCI runtime,as it closes the TOCTOU window. There could be a specified CRI error that could be bubbled up that the kubelet could respond with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I'm hesitant to plumb this down into the runtime before we have a better understanding of what the long-term direction is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the alternatives besides reverting the behavior change in the kernel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leveraging memory.high? @roycaihw any other mechanisms under consideration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use memory.high and memory.reclaim to bring down the memory usage more safely.

  • memory.reclaim allows us to proactively reclaim memory, to bring the usage down towards our target if needed.
  • memory.high sets a soft limit without OOM killing by doing proactive memory reclaim and even throttling the process, allowing us to take further actions (bringing down memory.max if the process is stable, or reverting memory.high if the process suffers from throttling)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense. It would be nice if we didn't need to worry about reading container memory usage first...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this would be a better fit for the CRI runtime layer than runc. @chrishenzie @samuelkarp @mikebrow can any of you chime in for containerd, and whether this approach seems viable?

Would it make sense to implement this in a shared library that cri-o & containerd could both use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really OCI runtime would be best as that's closest, which can be considered a shared layer. If the OCI runtime doesn't support it CRI could do it as a backup, using runtime features to figure it out maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the CRI API supports UpdateContainerResources (including the memory limit) this seems like an appropriate place for this logic. I'm not sure if the OCI runtime is an appropriate place for this logic. Is it typically responsible for performing these kinds of checks before updating containers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I'm worried about skew between enabling memory limit decreases, and getting the protection logic into the runtime (either OCI or CRI layer). This leaves 2 paths forward:

  1. work on getting it into the runtime, and only lift the k8s restriction once there is sufficient time for the runtime protections to roll out
  2. proceed with the protections implemented in the kubelet, and in parallel work on adding the runtime-layer protections. Once the runtime mechanisms have been in place for sufficiently long to roll out, we can remove the Kubelet mechanisms.

The restriction on memory limit decrease is an overly strict limitation that constrains usage of IPPR, especially for guaranteed pods. I'm worried that the first approach (wait for runtime implementation) will delay this too long, so I propose we go with the second option. I still agree that we ultimately want to implement this in the runtime, but I think we should put the basic check in the Kubelet to unblock this feature.

WDYT?

@roycaihw
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025
@natasha41575
Copy link
Contributor

/lgtm

usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be
skipped for that container. The pod condition `PodResizeInProgress` will remain, with an `Error`
reason, and a message reporting the current usage & desired limit. This is considered best-effort
since it is still subject to a time-of-check-time-of-use (TOCTOU) race condition where the usage exceeds the limit after the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we incorporate some timeout on this? What if we declare some mechanism to push an app to free up resources (some kind of hungry baloon process inside the Pod cgroup), or some kind of a signal to the Pod. Once we will have this mechanism, we will need a delay in applying the resize. So I wonder if we should assume that in future the mechanism like this will exist and we need to account for a timeout that will be introduced.

It also will be easier if we will introduce a new CRI call for this. Something like PrepareForMemoryDecrease().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the questions here are:

  1. Should container runtime get a prepare for resize signal?
  2. Whenever we have a mechanism to tell an app or runtime that we need to downscale, will we be able to introduce the timeout mechanism easily?
  3. Should we investigate techniques like gradual decrease by a few Mb with the above mentioned checks so we can create a pressure for an app without creating the OOMkill in a single memory size decrease?

In any case, I believe that introducing some sort of timeout for the memory shrink operation will be very useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions, @SergeyKanzhelev.

The root of the problem here is that shrinking memory can be a longer running operation, and we need to decide whether the Kubelet will control the operation, or if the runtime will handle it (preferably asynchronously).

IMO, I think the long-term direction of this should be that the CRI resize call (UpdateContainerResources) is asynchronous. Currently the Kubelet assumes that if the call succeeds, then the resize is complete. The big blocker to making the call asynchronous is the PLEG needs a signal or indicator that the resize is complete. This is covered by kubernetes/kubernetes#131112, but needs to work for the generic PLEG too.

Independently of memory limit decrease, setting a timeout on the UpdateContainerResources call seems prudent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the operation, even if asynchronous, must have some deadline on completion and a way to cancel/revert by the kubelet. I would think a bit more design is needed in the KEP text to explain the model we will pursue. If we are going with what is written, how we will transition to the new model (new CRI API? change existing APIs semantics?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for immediate implementation - maybe kubelet needs to query metrics a few times to account for the metrics update interval. Or we are force-querying memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants