fix: accurately track allocatable resources for nodes #1420

BEvgeniyS · 2024-07-15T12:54:43Z

Description
The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the VM_MEMORY_OVERHEAD_PERCENT global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.

Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept.
In this pull request, I'm applying the same concept.

To demonstrate the issue:

Set VM_MEMORY_OVERHEAD_PERCENT to 0
Create a nodepool with a single instance type:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: approaching-allocatable-nodepool-0
spec:
  limits:
    cpu: "18"
    memory: 36Gi
  template:
    metadata:
      labels:
        approaching-allocatable: nodepool-0
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: approaching-allocatable-nodeclass-0
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t4g.medium
      taints:
      - effect: NoExecute
        key:  approaching-allocatable
        value: "nodepool-0"
      kubelet:
        systemReserved:
          memory: "1Ki"
        kubeReserved:
          memory: "1Ki"
        evictionHard:
          memory.available: "1Ki"

Create a workload with request close to node's allocatable:

apiVersion: v1
kind: Pod
metadata:
  name: approaching-allocatable-pod
  namespace: default
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: approaching-allocatable
            operator: In
            values:
            - nodepool-0
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause@sha256:c2518f6d82392ba799d551398805aaa7af70548015263d962afe9710c0eaa1b2
    name: trigger-pod
    resources:
      requests:
        cpu: 10m
        memory: 3686Mi
  tolerations:
  - effect: NoExecute
    key: approaching-allocatable
    operator: Equal
    value: nodepool-0

Observed behaviors

Resolving Resource Overestimation:
- v0.37.0 behavior: Karpenter continuously creates and consolidates nodes without realizing the impossibility of fitting the workload.
- Patched behavior: Accurately tracks actual allocatable resources, preventing the endless loop of node creation and consolidation.
Addressing Resource Underestimation:
- v0.37.0 behavior: Karpenter leaves pods pending indefinitely or chooses an instance type larger than necessary, failing to learn from actual node allocatables when launched for other reasons.
- Patched behavior: Remembers true allocatable resources if a node is ever launched, enabling correct node launches for previously pending pods.
Avoiding Extra Churn:
- v0.37.0 behavior: Incorrect predicted allocatable resources during consolidation lead to unnecessary churn.
- Patched behavior: Scheduling simulations benefit from knowledge about true allocatable resources

The above improvements are implemented using a shared cache that can be accessed from:

lifecycle package: to populate the cache as soon as a node is registered.
scheduling package: to use real allocatable resources when making itFits decisions from the cache, if available.
hash package: to flush the cache for a nodepool after an update.

I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.

How was this change tested?
For overestimation:
I ran this in one of our preprod EKS cluster with vmMemoryOverheadPercent=0, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.

For underestimation:
The test was to

set high VM_MEMORY_OVERHEAD_PERCENT value (like 0.2)
Run the workload that was fitting before, observe it's pending
Adding another workload for same nodepool, but with lower request. That launches the real node
Another node would launch for the pod from step 2, and new pods with same requests are now correctly cause new node to be launched

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2024-07-15T12:54:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BEvgeniyS
Once this PR has been reviewed and has the lgtm label, please assign tzneal for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-07-15T12:54:51Z

Welcome @BEvgeniyS!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-07-15T12:54:52Z

Hi @BEvgeniyS. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2024-07-15T13:18:23Z

Pull Request Test Coverage Report for Build 11386834169

Details

43 of 52 (82.69%) changed or added relevant lines in 8 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.01%) to 80.925%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/nodepool/hash/controller.go	4	8	50.0%
pkg/controllers/controllers.go	0	5	0.0%

Totals
Change from base Build 11332670114:	0.01%
Covered Lines:	8523
Relevant Lines:	10532

💛 - Coveralls

ellistarn · 2024-07-16T18:49:27Z

pkg/controllers/nodeclaim/lifecycle/registration.go

+	// Update cached allocatables
+	cacheMapKey := fmt.Sprintf(
+		"allocatableCache;%s;%s",
+		nodeClaim.Labels[v1.NodePoolLabelKey],


What happens when nodepool.spec.template.nodeClassRef changes?

Node hash recalculation triggers, and cache is flushed for that nodepool

https://github.com/kubernetes-sigs/karpenter/pull/1420/files#diff-3fff53ac08298c9ad8e732e5fdeda2aea5018bc888160d6fa5fb432f4aad9f78R68

Now that kubelet setting moved to nodeclass, this requires an update

Based on this, it seems that change in nodeclass spec is not supposed to trigger the hash recalculation, to avoid the unnecessary drift.

In this case, cache will be updated first time the node of this nodepool/instancetype comes up with updated allocatables

ellistarn · 2024-07-16T18:50:04Z

pkg/utils/sharedcache/sharedcache.go

+limitations under the License.
+*/
+
+package sharedcache


Why not just inject this into the controller at initialization time like we do with our other caches?

Something like this?

I've tried it at first but the amount of changes seemed too much. What do you think, what's better?

I don't like the amount of layers we have to pass the thing around, and all the changes to all the tests etc, which makes this change much bigger than necessary

It's how we do things today, in the codebase. We wire dependencies around explicitly, rather than global singletons or a DI framework.

Usually you can just instantiate it as part of the parent class's constructor. If used in multiple places, we wire it up as needed.

Understood. I definitely want to follow existing practices.

Can you review https://github.com/BEvgeniyS/karpenter/pull/2/files then? I'll just push the changes to this branch if it's correct

updated this PR to use injection

ellistarn · 2024-07-16T18:51:02Z

pkg/controllers/provisioning/scheduling/nodeclaim.go

@@ -251,10 +252,22 @@ func filterInstanceTypesByRequirements(instanceTypes []*cloudprovider.InstanceTy
 		fitsAndOffering:         false,
 	}

-	for _, it := range instanceTypes {
+	for _, it := range n.InstanceTypeOptions {


I wonder if we should collapse this code with https://github.com/kubernetes-sigs/karpenter/pull/1379/files, which aims to solve this type of problem

I think this might lead to referring instanceType provider from more packages than it really needs to be referred from...

ellistarn · 2024-07-16T18:51:34Z

pkg/controllers/provisioning/scheduling/nodeclaim.go

@@ -239,7 +240,7 @@ func (r filterResults) FailureReason() string {
 }

 //nolint:gocyclo
-func filterInstanceTypesByRequirements(instanceTypes []*cloudprovider.InstanceType, requirements scheduling.Requirements, requests v1.ResourceList) filterResults {
+func (n *NodeClaim) filterInstanceTypesByRequirements(requirements scheduling.Requirements, requests v1.ResourceList) filterResults {


escherize · 2024-07-17T22:17:47Z

Thanks @BEvgeniyS, very cool!

BEvgeniyS · 2024-07-25T23:18:46Z

Given that kubelet section is moved to the nodeclass, it would probably make sense to move this logic to the cloudprovider?

Seems like Kwok provider doesn't have this issue (and it's not obviously how one can replicate it), this could be cloud provider-specific issue

BEvgeniyS · 2024-07-29T06:54:17Z

Updated PR with dependency injection

@ellistarn @tallaxes @jackfrancis
Any chance you can take a look?

github-actions · 2024-08-13T12:01:57Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

BEvgeniyS · 2024-08-19T03:58:31Z

@ellistarn Can we remove stale label from this pr?

BEvgeniyS · 2024-09-16T12:06:14Z

Since we've been running this as our fork in prod, we have discovered that allocatable memory and ephemeral storage may fluctuate due to kernel reservations (confirmed through comparing dmesg outputs, difference in single digit Ki between different instances)

That has caused unnecessary cache updates
I've changed the pr to not update the cache from every new node, but just reuse the cached information until cache is cleared

c818941

github-actions · 2024-10-15T12:02:10Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

BEvgeniyS · 2024-10-21T23:04:24Z

Closing in favor of aws/karpenter-provider-aws#7004

Thanks @jmdeal!

BEvgeniyS added 2 commits July 15, 2024 15:09

Use cached discovered allocatables

e07b6a3

Add tests

4986677

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 15, 2024

k8s-ci-robot requested review from jackfrancis and tallaxes July 15, 2024 12:54

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 15, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 15, 2024

Add missing license

fe4422a

BEvgeniyS changed the title ~~Discover allocatable~~ fix: accurately track allocatable resources for nodes Jul 15, 2024

Merge remote-tracking branch 'upstream/main' into discover-allocatable

7352ff8

ellistarn reviewed Jul 16, 2024

View reviewed changes

BEvgeniyS mentioned this pull request Jul 17, 2024

Discover allocatables pr - with injection BEvgeniyS/karpenter#2

Closed

Merge branch 'main' into discover-allocatable

74d32ac

BEvgeniyS requested a review from ellistarn July 17, 2024 04:39

BEvgeniyS added 3 commits July 20, 2024 19:13

Merge branch 'main' into discover-allocatable

8d7a11b

Merge branch 'main' into discover-allocatable

5b6d969

Merge branch 'main' into discover-allocatable

33e480f

BEvgeniyS added 3 commits July 29, 2024 13:37

Add logging to cache update operations

e47e52f

Switch to using dependency injection

9b80ca6

Add tests

db3e284

Log message

00915c6

BEvgeniyS force-pushed the discover-allocatable branch from 0c5e4f2 to 00915c6 Compare July 30, 2024 04:20

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2024

BEvgeniyS and others added 2 commits August 19, 2024 13:35

Merge branch 'temp-branch' into discover-allocatable

95b5442

Merge branch 'main' into discover-allocatable

0316438

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2024

BEvgeniyS added 4 commits August 21, 2024 12:03

Merge branch 'main' into discover-allocatable

9116016

Merge branch 'main' into discover-allocatable

e69d9cc

Merge branch 'main' into discover-allocatable

6bbedb2

Merge branch 'main' into discover-allocatable

bf1aada

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2024

BEvgeniyS added 3 commits September 16, 2024 21:23

Merge master in

238ab2e

Do not update cache if already found

c818941

emtpy lines. they are important

8b07794

Merge branch 'main' into discover-allocatable

2dd3a15

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2024

BEvgeniyS added 2 commits September 24, 2024 10:50

Merge branch 'main' into discover-allocatable

e6e768e

Merge branch 'main' into discover-allocatable

a3bbcb6

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2024

Merge remote-tracking branch 'upstream/main' into discover-allocatable

7dc924e

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024

Merge branch 'main' into discover-allocatable

6496db3

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2024

BEvgeniyS closed this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: accurately track allocatable resources for nodes #1420

fix: accurately track allocatable resources for nodes #1420

BEvgeniyS commented Jul 15, 2024

k8s-ci-robot commented Jul 15, 2024

k8s-ci-robot commented Jul 15, 2024

k8s-ci-robot commented Jul 15, 2024

coveralls commented Jul 15, 2024 •

edited

Loading

ellistarn Jul 16, 2024

BEvgeniyS Jul 17, 2024

BEvgeniyS Jul 26, 2024

BEvgeniyS Jul 29, 2024

ellistarn Jul 16, 2024

BEvgeniyS Jul 17, 2024 •

edited

Loading

ellistarn Jul 17, 2024

BEvgeniyS Jul 17, 2024

BEvgeniyS Jul 29, 2024

ellistarn Jul 16, 2024

BEvgeniyS Jul 17, 2024

ellistarn Jul 16, 2024

BEvgeniyS Jul 17, 2024

escherize commented Jul 17, 2024

BEvgeniyS commented Jul 25, 2024

BEvgeniyS commented Jul 29, 2024

github-actions bot commented Aug 13, 2024

BEvgeniyS commented Aug 19, 2024

BEvgeniyS commented Sep 16, 2024

github-actions bot commented Oct 15, 2024

BEvgeniyS commented Oct 21, 2024

fix: accurately track allocatable resources for nodes #1420

fix: accurately track allocatable resources for nodes #1420

Conversation

BEvgeniyS commented Jul 15, 2024

Observed behaviors

k8s-ci-robot commented Jul 15, 2024

k8s-ci-robot commented Jul 15, 2024

k8s-ci-robot commented Jul 15, 2024

coveralls commented Jul 15, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11386834169

Details

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BEvgeniyS Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

escherize commented Jul 17, 2024

BEvgeniyS commented Jul 25, 2024

BEvgeniyS commented Jul 29, 2024

github-actions bot commented Aug 13, 2024

BEvgeniyS commented Aug 19, 2024

BEvgeniyS commented Sep 16, 2024

github-actions bot commented Oct 15, 2024

BEvgeniyS commented Oct 21, 2024

coveralls commented Jul 15, 2024 •

edited

Loading

BEvgeniyS Jul 17, 2024 •

edited

Loading