-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-1287: Allow memory limit decreases #5368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tallclair The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc |
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a | ||
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any | ||
guarantees. Before decreasing container memory limits, the Kubelet will read the container memory | ||
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this happens, would it try again in the next sync loop iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. @tallclair Please correct me if I'm wrong.
|
||
Memory limit decreases with `RestartRequired` are still allowed. | ||
_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this suggesting that pod-level memory limit decrease was already supported in 1.33 for NotRequired
policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move more complex logic elsewhere? I think it would be better to have mem resizing handling in VPA using some mem resizing configs (to be agreed what that means) and apply them to a workload in form of options in VPA Resource Policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with having more complex logic living in controllers such as VPA makes sense, but VPA still won't be able to remove memory limits if we don't relax the restriction in the apiserver. We need to make sure what we do here is reasonable for anyone who may be attempting to remove memory limits even outside the context of VPA or other controllers. The proposal in this KEP seems to me as the most simple thing we can possibly do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Thank you for confirmation that we will use VPA for more advanced mem resizing use cases.
It would be great to solve this limitation as part of k8s 1.34 to have mem limit resizing in VPA IPPR (targeting k8s 1.34)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.
I see. 1.33 nodes will not perform container-level memory check before reducing the memory limit, and may result in OOM kill if the target limit is below the current usage. LGTM given the limit reduction operation is best-effort.
In the initial beta release of in-place resize, we will **disallow** `PreferNoRestart` memory limit | ||
decreases, enforced through API validation. The intent is for this restriction to be relaxed in the | ||
future, but the design of how limit decreases will be approached is still undecided. | ||
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a | |
If the memory resize restart policy is specified 'NotRequired' ( if not specified, it defaults to `NotRequired` ), the Kubelet will make a |
decreases, enforced through API validation. The intent is for this restriction to be relaxed in the | ||
future, but the design of how limit decreases will be approached is still undecided. | ||
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a | ||
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any | |
**best-effort** attempt to prevent oom-kills when decreasing memory limits below container used memory, but doesn't provide any |
@@ -891,7 +894,8 @@ This will be reconsidered post-beta as a future enhancement. | |||
|
|||
### Future Enhancements | |||
|
|||
1. Allow memory limits to be decreased, and handle the case where limits are set below usage. | |||
1. Improve memory limit decrease oom-kill prevention by leveraging other kernel mechanisms or using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are current Kubernetes minimum Kernel version requirements sufficient for the kernel mechanisms you are referring to ? Which kernel mechanisms do you plan to leverage ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still TBD. The initial proposal was to leverage memory.high
, but there might be some problems when combining that with memory.max
. Further exploration is needed, which is why I left this open ended.
cc @roycaihw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will update the kernel version requirement public doc based on the kernel features we use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying !
/lgtm
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a | ||
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any | ||
guarantees. Before decreasing container memory limits, the Kubelet will read the container memory | ||
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the runtime would be better for this, or even OCI runtime,as it closes the TOCTOU window. There could be a specified CRI error that could be bubbled up that the kubelet could respond with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I'm hesitant to plumb this down into the runtime before we have a better understanding of what the long-term direction is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the alternatives besides reverting the behavior change in the kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe leveraging memory.high
? @roycaihw any other mechanisms under consideration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use memory.high
and memory.reclaim
to bring down the memory usage more safely.
memory.reclaim
allows us to proactively reclaim memory, to bring the usage down towards our target if needed.memory.high
sets a soft limit without OOM killing by doing proactive memory reclaim and even throttling the process, allowing us to take further actions (bringing downmemory.max
if the process is stable, or revertingmemory.high
if the process suffers from throttling)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this makes sense. It would be nice if we didn't need to worry about reading container memory usage first...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this would be a better fit for the CRI runtime layer than runc. @chrishenzie @samuelkarp @mikebrow can any of you chime in for containerd, and whether this approach seems viable?
Would it make sense to implement this in a shared library that cri-o & containerd could both use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really OCI runtime would be best as that's closest, which can be considered a shared layer. If the OCI runtime doesn't support it CRI could do it as a backup, using runtime features to figure it out maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the CRI API supports UpdateContainerResources
(including the memory limit) this seems like an appropriate place for this logic. I'm not sure if the OCI runtime is an appropriate place for this logic. Is it typically responsible for performing these kinds of checks before updating containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, I'm worried about skew between enabling memory limit decreases, and getting the protection logic into the runtime (either OCI or CRI layer). This leaves 2 paths forward:
- work on getting it into the runtime, and only lift the k8s restriction once there is sufficient time for the runtime protections to roll out
- proceed with the protections implemented in the kubelet, and in parallel work on adding the runtime-layer protections. Once the runtime mechanisms have been in place for sufficiently long to roll out, we can remove the Kubelet mechanisms.
The restriction on memory limit decrease is an overly strict limitation that constrains usage of IPPR, especially for guaranteed pods. I'm worried that the first approach (wait for runtime implementation) will delay this too long, so I propose we go with the second option. I still agree that we ultimately want to implement this in the runtime, but I think we should put the basic check in the Kubelet to unblock this feature.
WDYT?
17cf87e
to
c7b9304
Compare
/lgtm |
/lgtm |
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be | ||
skipped for that container. The pod condition `PodResizeInProgress` will remain, with an `Error` | ||
reason, and a message reporting the current usage & desired limit. This is considered best-effort | ||
since it is still subject to a time-of-check-time-of-use (TOCTOU) race condition where the usage exceeds the limit after the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we incorporate some timeout on this? What if we declare some mechanism to push an app to free up resources (some kind of hungry baloon process inside the Pod cgroup), or some kind of a signal to the Pod. Once we will have this mechanism, we will need a delay in applying the resize. So I wonder if we should assume that in future the mechanism like this will exist and we need to account for a timeout that will be introduced.
It also will be easier if we will introduce a new CRI call for this. Something like PrepareForMemoryDecrease()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the questions here are:
- Should container runtime get a prepare for resize signal?
- Whenever we have a mechanism to tell an app or runtime that we need to downscale, will we be able to introduce the timeout mechanism easily?
- Should we investigate techniques like gradual decrease by a few Mb with the above mentioned checks so we can create a pressure for an app without creating the OOMkill in a single memory size decrease?
In any case, I believe that introducing some sort of timeout for the memory shrink operation will be very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions, @SergeyKanzhelev.
The root of the problem here is that shrinking memory can be a longer running operation, and we need to decide whether the Kubelet will control the operation, or if the runtime will handle it (preferably asynchronously).
IMO, I think the long-term direction of this should be that the CRI resize call (UpdateContainerResources
) is asynchronous. Currently the Kubelet assumes that if the call succeeds, then the resize is complete. The big blocker to making the call asynchronous is the PLEG needs a signal or indicator that the resize is complete. This is covered by kubernetes/kubernetes#131112, but needs to work for the generic PLEG too.
Independently of memory limit decrease, setting a timeout on the UpdateContainerResources
call seems prudent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the operation, even if asynchronous, must have some deadline on completion and a way to cancel/revert by the kubelet. I would think a bit more design is needed in the KEP text to explain the model we will pursue. If we are going with what is written, how we will transition to the new model (new CRI API? change existing APIs semantics?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for immediate implementation - maybe kubelet needs to query metrics a few times to account for the metrics update interval. Or we are force-querying memory?
For the IPPR beta MVP in v1.33 we decided to punt on figuring out how to do memory limit decreases, and prevent any memory limit decrease.
For v1.34, we want to lift this restriction, and make a best-effort attempt to prevent oom-kills.
Issue: #1287
/milestone v1.34
/assign @natasha41575 @roycaihw