Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Node Repair implementation #1793

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

engedaam
Copy link
Contributor

@engedaam engedaam commented Oct 30, 2024

Fixes #N/A

Description

  • RFC: RFC: Node Auto Repair #1768
  • This PR is the implementation of the recommend solution defined in the node repair RFC
  • Defining a cloud provider interface RepairPolicy that will support node conditions that Karpenter will forcefully terminate nodes. The cloud provider policies will be unhealthy conditions a node can enter and the duration for Karpenter to react.

How was this change tested?

  • make resubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 30, 2024
@coveralls
Copy link

coveralls commented Oct 30, 2024

Pull Request Test Coverage Report for Build 11766614679

Details

  • 62 of 83 (74.7%) changed or added relevant lines in 5 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.05%) to 80.822%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 7 0.0%
pkg/controllers/node/health/controller.go 48 62 77.42%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/consolidation.go 4 88.55%
Totals Coverage Status
Change from base Build 11749223840: -0.05%
Covered Lines: 8538
Relevant Lines: 10564

💛 - Coveralls

@engedaam engedaam changed the title feat: Node Auto Repair implementation feat: Node Repair implementation Nov 7, 2024
@engedaam engedaam marked this pull request as ready for review November 7, 2024 23:53
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2024
@engedaam engedaam force-pushed the node-repair-implementation branch 2 times, most recently from 2338123 to 8cefba7 Compare November 8, 2024 00:07
if len(cloudProvider.RepairPolicy()) != 0 && !options.FromContext(ctx).FeatureGates.NodeRepair {
controllers = append(controllers, health.NewController(kubeClient, cloudProvider, clock))
} else {
log.FromContext(ctx).V(1).Info("node repair has been disabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to log that it's disabled -- you probably want to log the opposite since this is disabled by default.

Also, you generally want to log when you are doing actions, not when you aren't doing them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to follow how we approached the drift feature flag. In there, we would log each time we removed a status condition, and I was thinking it made sense to log the feature setting once on startup. I removed it now

pkg/controllers/controllers.go Outdated Show resolved Hide resolved
pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
}

// 3. Otherwise, if the Node is unhealthy and past it's tolerationDisruption window we can forcefully terminate the node
if err := c.kubeClient.Delete(ctx, node); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ignoring not found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that saw didn't do that for other delete calls, and just re-queue, but I can add that

return reconcile.Result{RequeueAfter: disruptionTime.Sub(c.clock.Now())}, nil
}

nodeClaims, err := nodeutils.GetNodeClaims(ctx, node, c.kubeClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ignore cases where we either don't find a nodeclaim or get duplicate nodeclaims?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I found NodeClaimForNode that will validate the duplicate check. Switch to using that now. Might be better if use that else where

return reconcile.Result{}, err
}
// 4. The deletion timestamp has successfully been set for the Node, update relevant metrics.
log.FromContext(ctx).V(1).Info("deleting unhealthy node")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with other forms of disruption? Do we use info or debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With other forms of disruption we log using info.

pkg/controllers/node/health/controller.go Show resolved Hide resolved

func (c *Controller) annotateTerminationGracePeriod(ctx context.Context, nodeClaim *v1.NodeClaim) error {
stored := nodeClaim.DeepCopy()
terminationTime := c.clock.Now().Format(time.RFC3339)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a discussion of this in the RFC -- let's talk about this more -- we're choosing to just blast past PDBs -- I'm worried about how dangerous this might be -- also, does the node termination controller respect a previously set terminationGracePeriod annotation set on the Node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node termination controller will respect a previously set terminationGracePeriod

func (c *Controller) ensureTerminationGracePeriodTerminationTimeAnnotation(ctx context.Context, nodeClaim *v1.NodeClaim) error {
. I mentioned in the section on forceful termination on the need to not wait on pod termination https://github.com/kubernetes-sigs/karpenter/pull/1768/files#diff-997c287391c6ff3ac98a6fada129945da32e2c6de6d9c225352315b207527aeaR119

@engedaam engedaam force-pushed the node-repair-implementation branch 9 times, most recently from c8bed26 to 390c056 Compare November 8, 2024 16:14
@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 10, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants