KEP-1287: Allow memory limit decreases #5368

tallclair · 2025-06-03T21:52:36Z

For the IPPR beta MVP in v1.33 we decided to punt on figuring out how to do memory limit decreases, and prevent any memory limit decrease.

For v1.34, we want to lift this restriction, and make a best-effort attempt to prevent oom-kills.

Issue: #1287

/milestone v1.34
/assign @natasha41575 @roycaihw

k8s-ci-robot · 2025-06-03T21:52:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tallclair
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

esotsal · 2025-06-04T03:09:48Z

/cc

natasha41575 · 2025-06-04T03:36:12Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
+**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
+guarantees. Before decreasing container memory limits, the Kubelet will read the container memory
+usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be


If this happens, would it try again in the next sync loop iteration?

I think so. @tallclair Please correct me if I'm wrong.

roycaihw · 2025-06-06T00:37:58Z

keps/sig-node/1287-in-place-update-pod-resources/README.md


-Memory limit decreases with `RestartRequired` are still allowed.
+_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage.


Is this suggesting that pod-level memory limit decrease was already supported in 1.33 for NotRequired policy?

The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.

Would it make sense to move more complex logic elsewhere? I think it would be better to have mem resizing handling in VPA using some mem resizing configs (to be agreed what that means) and apply them to a workload in form of options in VPA Resource Policy.

I agree with having more complex logic living in controllers such as VPA makes sense, but VPA still won't be able to remove memory limits if we don't relax the restriction in the apiserver. We need to make sure what we do here is reasonable for anyone who may be attempting to remove memory limits even outside the context of VPA or other controllers. The proposal in this KEP seems to me as the most simple thing we can possibly do.

Got it! Thank you for confirmation that we will use VPA for more advanced mem resizing use cases.

It would be great to solve this limitation as part of k8s 1.34 to have mem limit resizing in VPA IPPR (targeting k8s 1.34)

The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.

I see. 1.33 nodes will not perform container-level memory check before reducing the memory limit, and may result in OOM kill if the target limit is below the current usage. LGTM given the limit reduction operation is best-effort.

esotsal · 2025-06-06T09:34:43Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

-In the initial beta release of in-place resize, we will **disallow** `PreferNoRestart` memory limit
-decreases, enforced through API validation. The intent is for this restriction to be relaxed in the
-future, but the design of how limit decreases will be approached is still undecided.
+If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a


Suggested change

If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a

If the memory resize restart policy is specified 'NotRequired' ( if not specified, it defaults to `NotRequired` ), the Kubelet will make a

esotsal · 2025-06-06T09:37:07Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

-decreases, enforced through API validation. The intent is for this restriction to be relaxed in the
-future, but the design of how limit decreases will be approached is still undecided.
+If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
+**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any


Suggested change

**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any

**best-effort** attempt to prevent oom-kills when decreasing memory limits below container used memory, but doesn't provide any

keps/sig-node/1287-in-place-update-pod-resources/README.md

esotsal · 2025-06-06T09:43:31Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

@@ -891,7 +894,8 @@ This will be reconsidered post-beta as a future enhancement.

 ### Future Enhancements

-1. Allow memory limits to be decreased, and handle the case where limits are set below usage.
+1. Improve memory limit decrease oom-kill prevention by leveraging other kernel mechanisms or using


Are current Kubernetes minimum Kernel version requirements sufficient for the kernel mechanisms you are referring to ? Which kernel mechanisms do you plan to leverage ?

Still TBD. The initial proposal was to leverage memory.high, but there might be some problems when combining that with memory.max. Further exploration is needed, which is why I left this open ended.

cc @roycaihw

We will update the kernel version requirement public doc based on the kernel features we use.

Thanks for clarifying !

/lgtm

haircommander · 2025-06-09T20:09:54Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
+**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
+guarantees. Before decreasing container memory limits, the Kubelet will read the container memory
+usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be


I think the runtime would be better for this, or even OCI runtime,as it closes the TOCTOU window. There could be a specified CRI error that could be bubbled up that the kubelet could respond with

I agree, but I'm hesitant to plumb this down into the runtime before we have a better understanding of what the long-term direction is.

what are the alternatives besides reverting the behavior change in the kernel?

Maybe leveraging memory.high? @roycaihw any other mechanisms under consideration?

We can use memory.high and memory.reclaim to bring down the memory usage more safely.

memory.reclaim allows us to proactively reclaim memory, to bring the usage down towards our target if needed.

memory.high sets a soft limit without OOM killing by doing proactive memory reclaim and even throttling the process, allowing us to take further actions (bringing down memory.max if the process is stable, or reverting memory.high if the process suffers from throttling)

Yeah, this makes sense. It would be nice if we didn't need to worry about reading container memory usage first...

I'm guessing this would be a better fit for the CRI runtime layer than runc. @chrishenzie @samuelkarp @mikebrow can any of you chime in for containerd, and whether this approach seems viable?

Would it make sense to implement this in a shared library that cri-o & containerd could both use?

really OCI runtime would be best as that's closest, which can be considered a shared layer. If the OCI runtime doesn't support it CRI could do it as a backup, using runtime features to figure it out maybe?

Since the CRI API supports UpdateContainerResources (including the memory limit) this seems like an appropriate place for this logic. I'm not sure if the OCI runtime is an appropriate place for this logic. Is it typically responsible for performing these kinds of checks before updating containers?

Thinking about this more, I'm worried about skew between enabling memory limit decreases, and getting the protection logic into the runtime (either OCI or CRI layer). This leaves 2 paths forward:

work on getting it into the runtime, and only lift the k8s restriction once there is sufficient time for the runtime protections to roll out

proceed with the protections implemented in the kubelet, and in parallel work on adding the runtime-layer protections. Once the runtime mechanisms have been in place for sufficiently long to roll out, we can remove the Kubelet mechanisms.

The restriction on memory limit decrease is an overly strict limitation that constrains usage of IPPR, especially for guaranteed pods. I'm worried that the first approach (wait for runtime implementation) will delay this too long, so I propose we go with the second option. I still agree that we ultimately want to implement this in the runtime, but I think we should put the basic check in the Kubelet to unblock this feature.

WDYT?

roycaihw · 2025-06-12T19:18:30Z

/lgtm

natasha41575 · 2025-06-12T19:42:31Z

/lgtm

SergeyKanzhelev · 2025-06-16T21:30:25Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be
+skipped for that container. The pod condition `PodResizeInProgress` will remain, with an `Error`
+reason, and a message reporting the current usage & desired limit. This is considered best-effort
+since it is still subject to a time-of-check-time-of-use (TOCTOU) race condition where the usage exceeds the limit after the


can we incorporate some timeout on this? What if we declare some mechanism to push an app to free up resources (some kind of hungry baloon process inside the Pod cgroup), or some kind of a signal to the Pod. Once we will have this mechanism, we will need a delay in applying the resize. So I wonder if we should assume that in future the mechanism like this will exist and we need to account for a timeout that will be introduced.

It also will be easier if we will introduce a new CRI call for this. Something like PrepareForMemoryDecrease().

I guess the questions here are:

Should container runtime get a prepare for resize signal?

Whenever we have a mechanism to tell an app or runtime that we need to downscale, will we be able to introduce the timeout mechanism easily?

Should we investigate techniques like gradual decrease by a few Mb with the above mentioned checks so we can create a pressure for an app without creating the OOMkill in a single memory size decrease?

In any case, I believe that introducing some sort of timeout for the memory shrink operation will be very useful.

Good questions, @SergeyKanzhelev.

The root of the problem here is that shrinking memory can be a longer running operation, and we need to decide whether the Kubelet will control the operation, or if the runtime will handle it (preferably asynchronously).

IMO, I think the long-term direction of this should be that the CRI resize call (UpdateContainerResources) is asynchronous. Currently the Kubelet assumes that if the call succeeds, then the resize is complete. The big blocker to making the call asynchronous is the PLEG needs a signal or indicator that the resize is complete. This is covered by kubernetes/kubernetes#131112, but needs to work for the generic PLEG too.

Independently of memory limit decrease, setting a timeout on the UpdateContainerResources call seems prudent.

the operation, even if asynchronous, must have some deadline on completion and a way to cancel/revert by the kubelet. I would think a bit more design is needed in the KEP text to explain the model we will pursue. If we are going with what is written, how we will transition to the new model (new CRI API? change existing APIs semantics?)

As for immediate implementation - maybe kubelet needs to query metrics a few times to account for the metrics update interval. Or we are force-querying memory?

k8s-ci-robot assigned natasha41575 and roycaihw Jun 3, 2025

k8s-ci-robot added this to the v1.34 milestone Jun 3, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 3, 2025

k8s-ci-robot requested review from dchen1107 and mrunalp June 3, 2025 21:52

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 3, 2025

k8s-ci-robot requested a review from esotsal June 4, 2025 03:09

natasha41575 reviewed Jun 4, 2025

View reviewed changes

roycaihw reviewed Jun 6, 2025

View reviewed changes

esotsal reviewed Jun 6, 2025

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Outdated Show resolved Hide resolved

esotsal reviewed Jun 6, 2025

View reviewed changes

haircommander reviewed Jun 9, 2025

View reviewed changes

KEP-1287: Allow memory limit decreases

c7b9304

tallclair force-pushed the ippr-mem-limits branch from 17cf87e to c7b9304 Compare June 9, 2025 23:24

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025

k8s-ci-robot assigned esotsal Jun 16, 2025

SergeyKanzhelev reviewed Jun 16, 2025

View reviewed changes

natasha41575 mentioned this pull request Jun 16, 2025

In-Place Update of Pod Resources #1287

Open

80 tasks


		Memory limit decreases with `RestartRequired` are still allowed.
		_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage.

	If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
	If the memory resize restart policy is specified 'NotRequired' ( if not specified, it defaults to `NotRequired` ), the Kubelet will make a

	best-effort attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
	best-effort attempt to prevent oom-kills when decreasing memory limits below container used memory, but doesn't provide any

KEP-1287: Allow memory limit decreases #5368

Are you sure you want to change the base?

KEP-1287: Allow memory limit decreases #5368

Conversation

tallclair commented Jun 3, 2025

Uh oh!

k8s-ci-robot commented Jun 3, 2025

Uh oh!

esotsal commented Jun 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esotsal Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

esotsal Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roycaihw commented Jun 12, 2025

Uh oh!

natasha41575 commented Jun 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

natasha41575 Jun 10, 2025 •

edited

Loading

esotsal Jun 6, 2025 •

edited

Loading

esotsal Jun 6, 2025 •

edited

Loading