Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve distruption for underutilization #992

Closed
wants to merge 13 commits into from

Conversation

Luke-Smartnews
Copy link

@Luke-Smartnews Luke-Smartnews commented Feb 7, 2024

Background

we're trying to use karpenter in our production environment, the scaling-out feature is working well, but the scaling-in(disruption) is a problem to us. There're are only two options: WhenUnderutilized | WhenEmpty ,

Proposals

  • add an option to check the node utilization first, normally if a node is fully utilized no need for disruption.
  • support consolidateAfter for WhenUnderutilized

Copy link

linux-foundation-easycla bot commented Feb 7, 2024

CLA Not Signed

@Luke-Smartnews Luke-Smartnews marked this pull request as draft February 7, 2024 03:21
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Luke-Smartnews
Once this PR has been reviewed and has the lgtm label, please assign njtran for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 7, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @Luke-Smartnews!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 7, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Luke-Smartnews. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 7, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 7, 2024
@@ -3,7 +3,8 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.14.0
controller-gen.kubebuilder.io/version: v0.8.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run make toolchain to update your controller-gen version

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 7, 2024
@Luke-Smartnews
Copy link
Author

Tested in our dev env, working without any issues. Will do more load tests.

@almson
Copy link

almson commented Feb 13, 2024

Can you write a description of how you've implemented this? This seems to drastically change how "WhenUnderutilized" works. Previously, it took effect when the cluster as a whole was underutilized, but now it seems to just mean that the node is underutilized?

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 13, 2024

Can you write a description of how you've implemented this? This seems to drastically change how "WhenUnderutilized" works.

Excited to see a POC!

Further than a description it should probably be first proposed as an RFC to get reviewed by the community, and to talk about alternative approaches or if this is behavior the community wants to accept. Seems interesting though :)

@Luke-Smartnews
Copy link
Author

Luke-Smartnews commented Feb 15, 2024

will do after I confirm it works on a large scale. also is there a template of RFC?

@Luke-Smartnews
Copy link
Author

Test Report

  • nodepool settings

     spec:
       disruption:
         budgets:
         - nodes: 10%
         consolidateAfter: 5m
         consolidationPolicy: WhenUnderutilized
         expireAfter: Never
         utilizationThreshold: 65
  • confirmed Underutilized added to target nodes

      conditions:
      - lastTransitionTime: "2024-02-23T11:24:05Z"
        status: "True"
        type: Initialized
      - lastTransitionTime: "2024-02-23T11:22:54Z"
        status: "True"
        type: Launched
      - lastTransitionTime: "2024-02-23T11:24:05Z"
        status: "True"
        type: Ready
      - lastTransitionTime: "2024-02-23T11:23:40Z"
        status: "True"
        type: Registered
      - lastTransitionTime: "2024-02-23T11:31:53Z"
        severity: Warning
        status: "True"
        type: Underutilized
  • pending pods scheduled in 5m
    Screenshot 2024-02-23 at 20 47 21

  • nodes are removed 5m later after pods deletion
    Screenshot 2024-02-23 at 20 47 47

@njtran
Copy link
Contributor

njtran commented Mar 4, 2024

Hey @Luke-Smartnews, this is a core difference in how Karpenter considers underutilization. We've intentionally not surfaced a utilization threshold due to its edge cases and how it ends up driving overall lower utilization. This would be a huge change to our disruption logic, and would definitely require opening up a design + RFC. I'm not sure we would be accepting this feature (at least in its current state without a design), but I would love to hear about the use-cases you're trying to solve here.

@barryrobison
Copy link

Hey @Luke-Smartnews, this is a core difference in how Karpenter considers underutilization. We've intentionally not surfaced a utilization threshold due to its edge cases and how it ends up driving overall lower utilization. This would be a huge change to our disruption logic, and would definitely require opening up a design + RFC. I'm not sure we would be accepting this feature (at least in its current state without a design), but I would love to hear about the use-cases you're trying to solve here.

@njtran is the current utilisation logic documented somewhere? From my brief reading of the source, it looks like karpenter does a "fake" scheduling run, and if the nodes pods can be scheduled elsewhere it consolidates the node?

Our use case is flink jobs performing data transfer. Those pods are bursty in nature, and we want to insure those pods run for at least 10 minutes so they can reach a checkpoint before getting rescheduled. The current system is reclaiming nodes in under 90s, so they barely have time to get scheduled before karpenter pulls the rug.

Thank you!

@njtran
Copy link
Contributor

njtran commented Mar 5, 2024

@barryrobison I think you want this #752, you're correct in that's how Karpenter considers Consolidation

@Luke-Smartnews
Copy link
Author

@njtran
my change only take effect when the nodepool is configured whit UtilizationThreshold, it doesn't change the existing behavior at all.

https://github.com/kubernetes-sigs/karpenter/pull/992/files#diff-c11b5a9240bbfac3dd14fdfff84e098ade0f3bf0a5fc63673de41ff710d3d308R105

@njtran
Copy link
Contributor

njtran commented Mar 15, 2024

Hey @Luke-Smartnews, even if this is just a change to the candidacy, we'd need to review a design/RFC before continuing on this. Would you be able to open a design PR first?

Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2024
@njtran
Copy link
Contributor

njtran commented Apr 8, 2024

Will close this until there's an RFC for this. Please feel free to re-open when you cut the RFC!

@njtran njtran closed this Apr 8, 2024
@booleanbetrayal
Copy link

@njtran - Could you refer to the RFC process needed to unblock this PR? We are seeing worse bin-packing performance from Karpenter than cluster-autoscaler and are debating switching back. Being able to leverage advanced consolidation features (eval time + thresholds) would put Karpenter on an equal footing for us. Please advise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants