Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add options to configure consolidation timeouts #1031

Conversation

domgoodwin
Copy link

@domgoodwin domgoodwin commented Feb 20, 2024

Issue: #903

Description
Adds option values to configure the timeout for multi and single node consolidations. Depending on the size of a cluster and your various affinities/topology spreads this can take longer then these hard coded values. Allowing them to be configured means large complex clusters can have longer timeouts if wanted.

How was this change tested?
I added this version as the Karpenter version for the aws-provider repo, built and image and deployed it on a test cluster. Working with the value both set and not set (defaulting to existing values) things were fine.
ie.

replace sigs.k8s.io/karpenter => github.com/domgoodwin/karpenter v0.0.0-20240220105243-cf5336018872

The unit test changes also cover setting the value to a non-default and then timing out after that time.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: domgoodwin
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 20, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @domgoodwin!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @domgoodwin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 20, 2024
@jonathan-innis
Copy link
Member

Depending on the size of a cluster and your various affinities/topology spreads this can take longer then these hard coded values

This features starts to look similar to our batch duration parameters, which we were a little iffy on whether or not we should have surfaced in the first-place. Do you also set the batch duration parameters to some custom values overriding the defaults?

The other question that I have: What would you set the values to if you could override them as is proposed here? And what's the size of your cluster that doesn't work well with the current timeout values?

Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2024
@domgoodwin
Copy link
Author

Depending on the size of a cluster and your various affinities/topology spreads this can take longer then these hard coded values

This features starts to look similar to our batch duration parameters, which we were a little iffy on whether or not we should have surfaced in the first-place. Do you also set the batch duration parameters to some custom values overriding the defaults?

The other question that I have: What would you set the values to if you could override them as is proposed here? And what's the size of your cluster that doesn't work well with the current timeout values?

I haven't changed the batch duration parameters currently as I figured this was more around scale out, which seems to work ok. We do get a bit of rubber banding where we add more nodes then we need and consolidate down but honestly the speed trade off by not having to run any headroom pods seems worth it.

We actually are running this code as our cluster just wasn't scaling in at all with these timeouts. We've currently set them to 10m, although based on metrics 5m would be fine.

We initially just threw more CPU at Karpenter hoping it would help but it only ever uses 2 CPU cores seemingly

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2024
@njtran
Copy link
Contributor

njtran commented Apr 8, 2024

I'm hesitant to make this configurable, as changing these knobs can have unknown impacts on the performance of Consolidation. Can you share what cluster sizes you're running at? Are you using any complex scheduling constraints like anti affinity? If anything, I can see making this scale to the size of the cluster in the long run, but I'd be concerned with giving users free reign over this.

@domgoodwin
Copy link
Author

The cluster is anywhere between 25k and 15k pods from peak to overnight, 250-500 nodes too.

We used to have anti-affinities but removed them across the board and saw a significant scheduling timing performance increase.
We do have podTopologySpread's setup basically on every pod to spread across AZs and hosts (we don't want 15 of a the same deployment on a node) but both of these are mostly "ScheduledAnyway" so soft constraints.

@njtran
Copy link
Contributor

njtran commented Apr 29, 2024

How have you come to these numbers? Did you test them out yourself? I can understand that these timeouts shouldn't be one size fits all, and it sounds like the numbers you're proposing would work for your situation better. Especially when adding an API surface (even though it's a feature gate), this is a type of feature that would still require an RFC. Do you mind writing one up? Feel free to reach out to the maintainers on the kubernetes slack to figure out how to write one, or check out the existing RFCs in the repo.

Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2024
@github-actions github-actions bot closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/closed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants