-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Node Repair implementation #1793
base: main
Are you sure you want to change the base?
feat: Node Repair implementation #1793
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: engedaam The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 11766614679Details
💛 - Coveralls |
4635c80
to
192984f
Compare
2338123
to
8cefba7
Compare
pkg/controllers/controllers.go
Outdated
if len(cloudProvider.RepairPolicy()) != 0 && !options.FromContext(ctx).FeatureGates.NodeRepair { | ||
controllers = append(controllers, health.NewController(kubeClient, cloudProvider, clock)) | ||
} else { | ||
log.FromContext(ctx).V(1).Info("node repair has been disabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want to log that it's disabled -- you probably want to log the opposite since this is disabled by default.
Also, you generally want to log when you are doing actions, not when you aren't doing them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to follow how we approached the drift feature flag. In there, we would log each time we removed a status condition, and I was thinking it made sense to log the feature setting once on startup. I removed it now
} | ||
|
||
// 3. Otherwise, if the Node is unhealthy and past it's tolerationDisruption window we can forcefully terminate the node | ||
if err := c.kubeClient.Delete(ctx, node); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ignoring not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found that saw didn't do that for other delete calls, and just re-queue, but I can add that
return reconcile.Result{RequeueAfter: disruptionTime.Sub(c.clock.Now())}, nil | ||
} | ||
|
||
nodeClaims, err := nodeutils.GetNodeClaims(ctx, node, c.kubeClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we ignore cases where we either don't find a nodeclaim or get duplicate nodeclaims?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I found NodeClaimForNode
that will validate the duplicate check. Switch to using that now. Might be better if use that else where
return reconcile.Result{}, err | ||
} | ||
// 4. The deletion timestamp has successfully been set for the Node, update relevant metrics. | ||
log.FromContext(ctx).V(1).Info("deleting unhealthy node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this consistent with other forms of disruption? Do we use info or debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With other forms of disruption we log using info.
|
||
func (c *Controller) annotateTerminationGracePeriod(ctx context.Context, nodeClaim *v1.NodeClaim) error { | ||
stored := nodeClaim.DeepCopy() | ||
terminationTime := c.clock.Now().Format(time.RFC3339) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a discussion of this in the RFC -- let's talk about this more -- we're choosing to just blast past PDBs -- I'm worried about how dangerous this might be -- also, does the node termination controller respect a previously set terminationGracePeriod annotation set on the Node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node termination controller will respect a previously set terminationGracePeriod
func (c *Controller) ensureTerminationGracePeriodTerminationTimeAnnotation(ctx context.Context, nodeClaim *v1.NodeClaim) error { |
c8bed26
to
390c056
Compare
390c056
to
562ed1f
Compare
7275033
to
03e110a
Compare
Fixes #N/A
Description
RepairPolicy
that will support node conditions that Karpenter will forcefully terminate nodes. The cloud provider policies will be unhealthy conditions a node can enter and the duration for Karpenter to react.How was this change tested?
make resubmit
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.