Fast scale up when an asg fault #6729

daimaxiaxie · 2024-04-18T06:48:03Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

when an asg fault(such as out of stock), Cluster Autoscaler should recovery quickly. Then scale up other asg.

Which issue(s) this PR fixes:

Special notes for your reviewer:

If the instance has error, add it to LongUnregistered immediately and then delete instance. Merge deleteCreatedNodesWithErrors into removeOldUnregisteredNodes.

Does this PR introduce a user-facing change?

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-04-18T06:48:12Z

Hi @daimaxiaxie. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

towca · 2024-04-19T09:49:26Z

/ok-to-test

MaciekPytel · 2024-04-19T11:59:30Z

This does a pretty significant refactor of error handling mechanism, that doesn't seem strictly necessary for fast fail path in AWS. At the same time it's pretty risky - the way AWS provider does fake nodes there may be no difference between deleteCreatedNodesWithErrors and removeOldUnregisteredNodes, but how fake nodes behind the scenes vary widely between providers.
If we want to make changes like that, I'd like a strong and clear argument that this will not make a difference that is observable by any provider, regardless of how it implements fake nodes behind the scenes (e.g. in GCE there are no fake nodes at all, GCE MIG will really have instance objects in error states that must be cleaned up by CA).

daimaxiaxie · 2024-04-24T11:48:55Z

Makes sense, I made some changes. @MaciekPytel

deleteCreatedNodesWithErrors only delete State == cloudprovider.InstanceCreating && ErrorInfo != nil node.
Now removeOldUnregisteredNodes only delete State = cloudprovider.InstanceCreating && ErrorInfo != nil node. Therefore, all cases of deleteCreatedNodesWithErrors can be covered(except MinSize, but normal scale up).

And State = cloudprovider.InstanceCreating && ErrorInfo != nil node join LongUnregistered has no effect on any provider.

This mechanism is not only valid for AWS, but also for other providers, such as Azure. Does not cause a difference that is observable by any provider, because it just moves the existing logic. This is also healthier than now.

k8s-ci-robot · 2024-07-20T10:24:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: daimaxiaxie
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-08-14T06:48:56Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2024-11-12T07:45:01Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 18, 2024

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/cluster-autoscaler labels Apr 18, 2024

k8s-ci-robot requested review from BigDarkClown and x13n April 18, 2024 06:48

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 19, 2024

daimaxiaxie mentioned this pull request May 6, 2024

cluster-autoscaler gets stuck with "Failed to fix node group sizes" error #6128

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2024

daimaxiaxie force-pushed the fast-scale-up-when-out-of-stock branch from 2ba4b25 to af00f93 Compare July 20, 2024 10:24

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 31, 2024

daimaxiaxie added 6 commits August 3, 2024 17:06

error state instance join unregistered

f1c1226

fix TestStaticAutoscalerInstanceCreationErrors

9e7b777

delete unhealthy node in removeOldUnregisteredNodes

60fb358

remove deleteCreatedNodesWithErrors

1868fe8

add comment on IsFakeNodeUnhealthy

60702fb

only handle creating instance

e08137b

daimaxiaxie force-pushed the fast-scale-up-when-out-of-stock branch from af00f93 to e08137b Compare August 3, 2024 09:14

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 3, 2024

daimaxiaxie added 2 commits August 3, 2024 17:34

filling overrideNodesToDeleteForZeroOrMax

b618212

fix unit test

1ae7f24

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast scale up when an asg fault #6729

Fast scale up when an asg fault #6729

daimaxiaxie commented Apr 18, 2024 •

edited

Loading

k8s-ci-robot commented Apr 18, 2024

towca commented Apr 19, 2024

MaciekPytel commented Apr 19, 2024

daimaxiaxie commented Apr 24, 2024 •

edited

Loading

k8s-ci-robot commented Jul 20, 2024

k8s-ci-robot commented Aug 14, 2024

k8s-triage-robot commented Nov 12, 2024

Fast scale up when an asg fault #6729

Are you sure you want to change the base?

Fast scale up when an asg fault #6729

Conversation

daimaxiaxie commented Apr 18, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 18, 2024

towca commented Apr 19, 2024

MaciekPytel commented Apr 19, 2024

daimaxiaxie commented Apr 24, 2024 • edited Loading

k8s-ci-robot commented Jul 20, 2024

k8s-ci-robot commented Aug 14, 2024

k8s-triage-robot commented Nov 12, 2024

daimaxiaxie commented Apr 18, 2024 •

edited

Loading

daimaxiaxie commented Apr 24, 2024 •

edited

Loading