New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: Node Repair implementation #1793

Open

engedaam wants to merge 4 commits into kubernetes-sigs:main from engedaam:node-repair-implementation

+482 −5

Contributor

engedaam commented Oct 30, 2024 •

edited

Loading

Fixes #N/A

Description

RFC: RFC: Node Auto Repair #1768
This PR is the implementation of the recommend solution defined in the node repair RFC
Defining a cloud provider interface RepairPolicy that will support node conditions that Karpenter will forcefully terminate nodes. The cloud provider policies will be unhealthy conditions a node can enter and the duration for Karpenter to react.

How was this change tested?

make resubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Contributor

k8s-ci-robot commented Oct 30, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot added do-not-merge/work-in-progress cncf-cla: yes labels

Contributor

k8s-ci-robot commented Oct 30, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from jackfrancis and tallaxes

October 30, 2024 21:17

k8s-ci-robot added the size/L label

coveralls commented Oct 30, 2024 •

edited

Loading

Pull Request Test Coverage Report for Build 11766614679

Details

62 of 83 (74.7%) changed or added relevant lines in 5 files are covered.
4 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.05%) to 80.822%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/controllers.go	0	7	0.0%
pkg/controllers/node/health/controller.go	48	62	77.42%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/disruption/consolidation.go	4	88.55%

Totals
Change from base Build 11749223840:	-0.05%
Covered Lines:	8538
Relevant Lines:	10564

💛 - Coveralls

engedaam changed the title ~~feat: Node Auto Repair implementation~~ feat: Node Repair implementation

engedaam force-pushed the node-repair-implementation branch from 4635c80 to 192984f Compare

November 7, 2024 23:39

engedaam marked this pull request as ready for review

November 7, 2024 23:53

k8s-ci-robot removed the do-not-merge/work-in-progress label

k8s-ci-robot requested a review from jmdeal

November 7, 2024 23:53

engedaam force-pushed the node-repair-implementation branch 2 times, most recently from 2338123 to 8cefba7 Compare

November 8, 2024 00:07

jonathan-innis reviewed

View reviewed changes

pkg/controllers/controllers.go Outdated

+              	if len(cloudProvider.RepairPolicy()) != 0 && !options.FromContext(ctx).FeatureGates.NodeRepair {
+              		controllers = append(controllers, health.NewController(kubeClient, cloudProvider, clock))
+              	} else {
+              		log.FromContext(ctx).V(1).Info("node repair has been disabled")

Member

jonathan-innis Nov 8, 2024

I don't think you want to log that it's disabled -- you probably want to log the opposite since this is disabled by default.

Also, you generally want to log when you are doing actions, not when you aren't doing them

Contributor Author

engedaam Nov 8, 2024

I was trying to follow how we approached the drift feature flag. In there, we would log each time we removed a status condition, and I was thinking it made sense to log the feature setting once on startup. I removed it now

pkg/controllers/controllers.go Outdated Show resolved Hide resolved

pkg/cloudprovider/types.go Outdated Show resolved Hide resolved

pkg/cloudprovider/types.go Outdated Show resolved Hide resolved

pkg/cloudprovider/types.go Outdated Show resolved Hide resolved

pkg/controllers/node/health/controller.go Outdated

+              	}
+              	// 3. Otherwise, if the Node is unhealthy and past it's tolerationDisruption window we can forcefully terminate the node
+              	if err := c.kubeClient.Delete(ctx, node); err != nil {

Member

jonathan-innis Nov 8, 2024

What about ignoring not found?

Contributor Author

engedaam Nov 8, 2024

Found that saw didn't do that for other delete calls, and just re-queue, but I can add that

pkg/controllers/node/health/controller.go Outdated

+              		return reconcile.Result{RequeueAfter: disruptionTime.Sub(c.clock.Now())}, nil
+              	}
+              	nodeClaims, err := nodeutils.GetNodeClaims(ctx, node, c.kubeClient)

Member

jonathan-innis Nov 8, 2024

Should we ignore cases where we either don't find a nodeclaim or get duplicate nodeclaims?

Contributor Author

engedaam Nov 8, 2024

Ahh I found NodeClaimForNode that will validate the duplicate check. Switch to using that now. Might be better if use that else where

pkg/controllers/node/health/controller.go

+              		return reconcile.Result{}, err
+              	}
+              	// 4. The deletion timestamp has successfully been set for the Node, update relevant metrics.
+              	log.FromContext(ctx).V(1).Info("deleting unhealthy node")

Member

jonathan-innis Nov 8, 2024

Is this consistent with other forms of disruption? Do we use info or debug?

Contributor Author

engedaam Nov 8, 2024

With other forms of disruption we log using info.

pkg/controllers/node/health/controller.go Show resolved Hide resolved

pkg/controllers/node/health/controller.go

+              func (c *Controller) annotateTerminationGracePeriod(ctx context.Context, nodeClaim *v1.NodeClaim) error {
+              	stored := nodeClaim.DeepCopy()
+              	terminationTime := c.clock.Now().Format(time.RFC3339)

Member

jonathan-innis Nov 8, 2024

I didn't see a discussion of this in the RFC -- let's talk about this more -- we're choosing to just blast past PDBs -- I'm worried about how dangerous this might be -- also, does the node termination controller respect a previously set terminationGracePeriod annotation set on the Node?

Contributor Author

engedaam Nov 8, 2024

The node termination controller will respect a previously set terminationGracePeriod

karpenter/pkg/controllers/nodeclaim/lifecycle/controller.go

Line 243 in fea57a1

    
           func (c *Controller) ensureTerminationGracePeriodTerminationTimeAnnotation(ctx context.Context, nodeClaim *v1.NodeClaim) error {

. I mentioned in the section on forceful termination on the need to not wait on pod termination https://github.com/kubernetes-sigs/karpenter/pull/1768/files#diff-997c287391c6ff3ac98a6fada129945da32e2c6de6d9c225352315b207527aeaR119

engedaam force-pushed the node-repair-implementation branch 9 times, most recently from c8bed26 to 390c056 Compare

November 8, 2024 16:14

mariuskimmina mentioned this pull request

feat(disruption): add node notready controller #1755

Closed

engedaam force-pushed the node-repair-implementation branch from 390c056 to 562ed1f Compare

November 10, 2024 02:27

k8s-ci-robot added do-not-merge/invalid-commit-message size/XXL and removed size/L labels

engedaam added 4 commits

November 10, 2024 16:06


          Add node repair controller


          Add Node Repair Feature flag

4f7dd09


          Update to use RepairStatements

fd6edf4


          Update to use RepairStatements

03e110a

engedaam force-pushed the node-repair-implementation branch from 7275033 to 03e110a Compare

November 10, 2024 16:07

k8s-ci-robot added size/L and removed do-not-merge/invalid-commit-message size/XXL labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes size/L