Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: accurately track allocatable resources for nodes #1420

Closed
wants to merge 26 commits into from

Conversation

BEvgeniyS
Copy link

Fixes aws/karpenter-provider-aws#5161

Description
The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the VM_MEMORY_OVERHEAD_PERCENT global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.

Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept.
In this pull request, I'm applying the same concept.

To demonstrate the issue:

  1. Set VM_MEMORY_OVERHEAD_PERCENT to 0
  2. Create a nodepool with a single instance type:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: approaching-allocatable-nodepool-0
spec:
  limits:
    cpu: "18"
    memory: 36Gi
  template:
    metadata:
      labels:
        approaching-allocatable: nodepool-0
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: approaching-allocatable-nodeclass-0
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t4g.medium
      taints:
      - effect: NoExecute
        key:  approaching-allocatable
        value: "nodepool-0"
      kubelet:
        systemReserved:
          memory: "1Ki"
        kubeReserved:
          memory: "1Ki"
        evictionHard:
          memory.available: "1Ki"
  1. Create a workload with request close to node's allocatable:
apiVersion: v1
kind: Pod
metadata:
  name: approaching-allocatable-pod
  namespace: default
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: approaching-allocatable
            operator: In
            values:
            - nodepool-0
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause@sha256:c2518f6d82392ba799d551398805aaa7af70548015263d962afe9710c0eaa1b2
    name: trigger-pod
    resources:
      requests:
        cpu: 10m
        memory: 3686Mi
  tolerations:
  - effect: NoExecute
    key: approaching-allocatable
    operator: Equal
    value: nodepool-0

Observed behaviors

  1. Resolving Resource Overestimation:

    • v0.37.0 behavior: Karpenter continuously creates and consolidates nodes without realizing the impossibility of fitting the workload.
    • Patched behavior: Accurately tracks actual allocatable resources, preventing the endless loop of node creation and consolidation.
  2. Addressing Resource Underestimation:

    • v0.37.0 behavior: Karpenter leaves pods pending indefinitely or chooses an instance type larger than necessary, failing to learn from actual node allocatables when launched for other reasons.
    • Patched behavior: Remembers true allocatable resources if a node is ever launched, enabling correct node launches for previously pending pods.
  3. Avoiding Extra Churn:

    • v0.37.0 behavior: Incorrect predicted allocatable resources during consolidation lead to unnecessary churn.
    • Patched behavior: Scheduling simulations benefit from knowledge about true allocatable resources

The above improvements are implemented using a shared cache that can be accessed from:

  • lifecycle package: to populate the cache as soon as a node is registered.
  • scheduling package: to use real allocatable resources when making itFits decisions from the cache, if available.
  • hash package: to flush the cache for a nodepool after an update.

I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.

How was this change tested?
For overestimation:
I ran this in one of our preprod EKS cluster with vmMemoryOverheadPercent=0, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.

For underestimation:
The test was to

  1. set high VM_MEMORY_OVERHEAD_PERCENT value (like 0.2)
  2. Run the workload that was fitting before, observe it's pending
  3. Adding another workload for same nodepool, but with lower request. That launches the real node
  4. Another node would launch for the pod from step 2, and new pods with same requests are now correctly cause new node to be launched

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BEvgeniyS
Once this PR has been reviewed and has the lgtm label, please assign tzneal for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 15, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @BEvgeniyS!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @BEvgeniyS. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 15, 2024
@coveralls
Copy link

coveralls commented Jul 15, 2024

Pull Request Test Coverage Report for Build 11386834169

Details

  • 43 of 52 (82.69%) changed or added relevant lines in 8 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 80.925%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/nodepool/hash/controller.go 4 8 50.0%
pkg/controllers/controllers.go 0 5 0.0%
Totals Coverage Status
Change from base Build 11332670114: 0.01%
Covered Lines: 8523
Relevant Lines: 10532

💛 - Coveralls

@BEvgeniyS BEvgeniyS changed the title Discover allocatable fix: accurately track allocatable resources for nodes Jul 15, 2024
// Update cached allocatables
cacheMapKey := fmt.Sprintf(
"allocatableCache;%s;%s",
nodeClaim.Labels[v1.NodePoolLabelKey],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when nodepool.spec.template.nodeClassRef changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that kubelet setting moved to nodeclass, this requires an update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this, it seems that change in nodeclass spec is not supposed to trigger the hash recalculation, to avoid the unnecessary drift.

In this case, cache will be updated first time the node of this nodepool/instancetype comes up with updated allocatables

limitations under the License.
*/

package sharedcache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just inject this into the controller at initialization time like we do with our other caches?

Copy link
Author

@BEvgeniyS BEvgeniyS Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

I've tried it at first but the amount of changes seemed too much. What do you think, what's better?

I don't like the amount of layers we have to pass the thing around, and all the changes to all the tests etc, which makes this change much bigger than necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's how we do things today, in the codebase. We wire dependencies around explicitly, rather than global singletons or a DI framework.

Usually you can just instantiate it as part of the parent class's constructor. If used in multiple places, we wire it up as needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I definitely want to follow existing practices.

Can you review https://github.com/BEvgeniyS/karpenter/pull/2/files then? I'll just push the changes to this branch if it's correct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this PR to use injection

@@ -251,10 +252,22 @@ func filterInstanceTypesByRequirements(instanceTypes []*cloudprovider.InstanceTy
fitsAndOffering: false,
}

for _, it := range instanceTypes {
for _, it := range n.InstanceTypeOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should collapse this code with https://github.com/kubernetes-sigs/karpenter/pull/1379/files, which aims to solve this type of problem

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might lead to referring instanceType provider from more packages than it really needs to be referred from...

@@ -239,7 +240,7 @@ func (r filterResults) FailureReason() string {
}

//nolint:gocyclo
func filterInstanceTypesByRequirements(instanceTypes []*cloudprovider.InstanceType, requirements scheduling.Requirements, requests v1.ResourceList) filterResults {
func (n *NodeClaim) filterInstanceTypesByRequirements(requirements scheduling.Requirements, requests v1.ResourceList) filterResults {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌔

@BEvgeniyS BEvgeniyS requested a review from ellistarn July 17, 2024 04:39
@escherize
Copy link

Thanks @BEvgeniyS, very cool!

@BEvgeniyS
Copy link
Author

Given that kubelet section is moved to the nodeclass, it would probably make sense to move this logic to the cloudprovider?

Seems like Kwok provider doesn't have this issue (and it's not obviously how one can replicate it), this could be cloud provider-specific issue

@BEvgeniyS
Copy link
Author

Updated PR with dependency injection

@ellistarn @tallaxes @jackfrancis
Any chance you can take a look?

Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2024
@BEvgeniyS
Copy link
Author

@ellistarn Can we remove stale label from this pr?

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2024
@BEvgeniyS
Copy link
Author

Since we've been running this as our fork in prod, we have discovered that allocatable memory and ephemeral storage may fluctuate due to kernel reservations (confirmed through comparing dmesg outputs, difference in single digit Ki between different instances)

That has caused unnecessary cache updates
I've changed the pr to not update the cache from every new node, but just reuse the cached information until cache is cleared

c818941

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2024
Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024
@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2024
@BEvgeniyS
Copy link
Author

Closing in favor of aws/karpenter-provider-aws#7004

Thanks @jmdeal!

@BEvgeniyS BEvgeniyS closed this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent
5 participants