-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: accurately track allocatable resources for nodes #1420
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: BEvgeniyS The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @BEvgeniyS! |
Hi @BEvgeniyS. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pull Request Test Coverage Report for Build 11386834169Details
💛 - Coveralls |
// Update cached allocatables | ||
cacheMapKey := fmt.Sprintf( | ||
"allocatableCache;%s;%s", | ||
nodeClaim.Labels[v1.NodePoolLabelKey], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when nodepool.spec.template.nodeClassRef
changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node hash recalculation triggers, and cache is flushed for that nodepool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that kubelet setting moved to nodeclass, this requires an update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this, it seems that change in nodeclass spec is not supposed to trigger the hash recalculation, to avoid the unnecessary drift.
In this case, cache will be updated first time the node of this nodepool/instancetype comes up with updated allocatables
pkg/utils/sharedcache/sharedcache.go
Outdated
limitations under the License. | ||
*/ | ||
|
||
package sharedcache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just inject this into the controller at initialization time like we do with our other caches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
I've tried it at first but the amount of changes seemed too much. What do you think, what's better?
I don't like the amount of layers we have to pass the thing around, and all the changes to all the tests etc, which makes this change much bigger than necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's how we do things today, in the codebase. We wire dependencies around explicitly, rather than global singletons or a DI framework.
Usually you can just instantiate it as part of the parent class's constructor. If used in multiple places, we wire it up as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I definitely want to follow existing practices.
Can you review https://github.com/BEvgeniyS/karpenter/pull/2/files then? I'll just push the changes to this branch if it's correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this PR to use injection
@@ -251,10 +252,22 @@ func filterInstanceTypesByRequirements(instanceTypes []*cloudprovider.InstanceTy | |||
fitsAndOffering: false, | |||
} | |||
|
|||
for _, it := range instanceTypes { | |||
for _, it := range n.InstanceTypeOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should collapse this code with https://github.com/kubernetes-sigs/karpenter/pull/1379/files, which aims to solve this type of problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might lead to referring instanceType provider from more packages than it really needs to be referred from...
@@ -239,7 +240,7 @@ func (r filterResults) FailureReason() string { | |||
} | |||
|
|||
//nolint:gocyclo | |||
func filterInstanceTypesByRequirements(instanceTypes []*cloudprovider.InstanceType, requirements scheduling.Requirements, requests v1.ResourceList) filterResults { | |||
func (n *NodeClaim) filterInstanceTypesByRequirements(requirements scheduling.Requirements, requests v1.ResourceList) filterResults { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🌔
Thanks @BEvgeniyS, very cool! |
Given that Seems like Kwok provider doesn't have this issue (and it's not obviously how one can replicate it), this could be cloud provider-specific issue |
Updated PR with dependency injection @ellistarn @tallaxes @jackfrancis |
0c5e4f2
to
00915c6
Compare
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
@ellistarn Can we remove stale label from this pr? |
Since we've been running this as our fork in prod, we have discovered that allocatable memory and ephemeral storage may fluctuate due to kernel reservations (confirmed through comparing dmesg outputs, difference in single digit Ki between different instances) That has caused unnecessary cache updates |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Closing in favor of aws/karpenter-provider-aws#7004 Thanks @jmdeal! |
Fixes aws/karpenter-provider-aws#5161
Description
The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the
VM_MEMORY_OVERHEAD_PERCENT
global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept.
In this pull request, I'm applying the same concept.
To demonstrate the issue:
VM_MEMORY_OVERHEAD_PERCENT
to 0Observed behaviors
Resolving Resource Overestimation:
Addressing Resource Underestimation:
Avoiding Extra Churn:
The above improvements are implemented using a shared cache that can be accessed from:
itFits
decisions from the cache, if available.I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.
How was this change tested?
For overestimation:
I ran this in one of our preprod EKS cluster with
vmMemoryOverheadPercent=0
, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.For underestimation:
The test was to
VM_MEMORY_OVERHEAD_PERCENT
value (like 0.2)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.