Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

typeid
Copy link
Contributor

@typeid typeid commented May 22, 2024

What this PR does / why we need it:

When MHC are remediating an unresponsive node, CAPI drains the node and gives up waiting for pods to be deleted after the SkipWaitForDeleteTimeoutSeconds. If one of the pods it gave up on has an attached volume, the machine deletion gets stuck.

This PR aligns the behavior for volume detachment during deletion to what we're doing with the drains: if the node is unreachable, we skip detaching volumes.

Which issue(s) this PR fixes
Fixes #10661

/area machine

@k8s-ci-robot k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 22, 2024
@k8s-ci-robot k8s-ci-robot requested a review from jackfrancis May 22, 2024 16:48
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 22, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @typeid. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@mjlshen mjlshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how OpenShift's handles this situation as well https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171. Furthermore, this allows pods to successfully be deleted, triggering CSI drivers to detach the relevant volumes, which allows the machine to be replaced

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 038bbf764ac6107032bcd184e1434415cc660f3d

@jackfrancis
Copy link
Contributor

@willie-yao @Jont828 Could you sanity check the equivalent functionality for MachinePoolMachines? Do we need changes there as well?

@sbueringer
Copy link
Member

@willie-yao @Jont828 Could you sanity check the equivalent functionality for MachinePoolMachines? Do we need changes there as well?

Do MachineHealthChecks work with MachinePool Machines? (I just don't remember that we implemented anything for it, at least in core CAPI)

Copy link
Member

@chrischdi chrischdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 27, 2024
@willie-yao
Copy link
Contributor

Could you sanity check the equivalent functionality for MachinePoolMachines? Do we need changes there as well?

I know for a fact that MachineHealthCheck is not implemented for ClusterClass MachinePools, as stated in the "Later" section of issue #5991. @Jont828 are you able to verify if they are implemented for MachinePoolMachines?

@enxebre
Copy link
Member

enxebre commented May 29, 2024

@typeid can you clarify where exactly in this https://github.com/kubernetes/kubectl/blob/master/pkg/drain/drain.go#L268-L360 flow this makes the difference?

@mjlshen
Copy link
Contributor

mjlshen commented May 30, 2024

@enxebre GracePeriodSeconds (the difference) is in https://github.com/kubernetes/kubectl/blob/d3ad75324522280fa8bccd7ad949b34336f5fc84/pkg/drain/drain.go#L312
--> https://github.com/kubernetes/kubectl/blob/d3ad75324522280fa8bccd7ad949b34336f5fc84/pkg/drain/drain.go#L150
--> https://github.com/kubernetes/kubectl/blob/d3ad75324522280fa8bccd7ad949b34336f5fc84/pkg/drain/drain.go#L133-L136

Skipping a couple steps here, but GracePeriodSeconds then takes effect in the kube-apiserver https://github.com/kubernetes/apiserver/blob/259cd1817cb1cc73bc33029df69934ac5c0d07ea/pkg/registry/generic/registry/store.go#L1116 where the graceful will eventually evaluate to false after the grace period has expired and the pods will be deleted by the api server instead of being stuck in a Terminating state forever.

That said, this is also behavior that could also be handled via https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown instead. The description exactly describes the symptoms of this bug as well.

When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods that are part of a StatefulSet will be stuck in terminating status on the shutdown node and cannot move to a new running node. This is because kubelet on the shutdown node is not available to delete the pods so the StatefulSet cannot create a new pod with the same name. If there are volumes used by the pods, the VolumeAttachments will not be deleted from the original shutdown node so the volumes used by these pods cannot be attached to a new running node. As a result, the application running on the StatefulSet cannot function properly. If the original shutdown node comes up, the pods will be deleted by kubelet and new pods will be created on a different running node. If the original shutdown node does not come up, these pods will be stuck in terminating status on the shutdown node forever.

@enxebre
Copy link
Member

enxebre commented May 30, 2024

That said, this is also behavior that could also be handled via https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown instead. The description exactly describes the symptoms of this bug as well.

I guess for us to take advantage of it the instance would need to be in the process of being shut down already, so it makes sense to me to mimic it earlier, at least for now.

Skipping a couple steps here, but GracePeriodSeconds then takes effect in the kube-apiserver https://github.com/kubernetes/apiserver/blob/259cd1817cb1cc73bc33029df69934ac5c0d07ea/pkg/registry/generic/registry/store.go#L1116 where the graceful will eventually evaluate to false after the grace period has expired and the pods will be deleted by the api server instead of being stuck in a Terminating state forever.

So does this effectively invalidates our SkipWaitForDeleteTimeoutSeconds, since all pods will match GracePeriodSeconds and get terminated first?

@typeid
Copy link
Contributor Author

typeid commented May 30, 2024

I initially had the same understanding as @mjlshen's explanation, but I did more digging and I'm no longer sure just setting the grace period to 1 is actually solving the issue here.

Reason:

  • if no grace period or -1 (CAPI case) is specified, we use the default deleteOptions: see here
  • we call the eviction api with these empty deleteOptions: see here
  • this results in using the default grace period for the object (30 seconds for pods if unspecified):
    • If this value is nil, the default grace period for the specified type will be used. source
    • For pods, this Defaults to 30 seconds. , see terminationGracePeriodSeconds here

So I would expect the following to happen (current logic as well as with 1 second grace period - nothing differs in terms of pod deletion):

However, at this point, the eviction API has already been called (both cases, current vs PR), and the deletion grace period should be set on the pod, so the api server should take care of it.
There seems to be no difference when it comes to actual pod deletion between explicitly setting it and letting it default, it should only decide the sooner or later.

I re-tested draining with grace period 1, and it doesn't delete the pods for my reproducer. I thought I had tried that, but by not working I think that confirms the logic above. Setting the grace period to 0, however, results in the pods deleting.

Further digging in apiserver...

The logic after that gets a bit complex, but it seems like setting it to 0 could be what want for immediate deletion.

I tried reproducing this with MAPI MHC, and it skips the pod deletion but doesn't get stuck replacing the node: I'm not sure what happens with the volume. Maybe it doesn't block the node deletion due to existing volumes? I'll have to dig further on that.

So does this effectively invalidates our SkipWaitForDeleteTimeoutSeconds, since all pods will match GracePeriodSeconds and get terminated first?

Not necessarily. I think in the case the pods still don't delete even with grace period = 1, we would still skip them after their deletionTimestamp is older than 5 minutes with that setting. MAPI actually sets SkipWaitForDeleteTimeoutSeconds to 1 as well, though (here and here). We might want to mimic that as well?

I'll sync up with @mjlshen to get the same understanding and see if I'm possibly missing something here. :)

@enxebre
Copy link
Member

enxebre commented May 30, 2024

With grace period 0 we run different logic on the apiserver side, return (graceful = false, gracefulPending =false , err = nil).
With grace period 1 (or -1, which results in deleteGracePeriod = pod setting, default 30 sec), we update the deletion timestamp and grace period: https://github.com/kubernetes/apiserver/blob/4aef12dc73bd08c68036e1f8990c869806cbab58/pkg/registry/rest/delete.go#L134 and return (graceful=true, gracefulpending=false, err = nil)

With a pending source code verification I assume the second case won't help in this scenario because it requires kubelet coordination via pod status, which is not possible because the kubelet is unreachable.

I tried reproducing this with MAPI MHC, and it skips the pod deletion but doesn't get stuck replacing the node: I'm not sure what happens with the volume. Maybe it doesn't block the node deletion due to existing volumes? I'll have to dig further on that.

Yes, the volume detaching decoupling form draining it's not implemented in mapi.

So to summarise, the erratic behaviour is that we have a semantic to ack an unreachable Node and an opinion on how react to it (We Assume that giving up on those pods after 5min is safe as the eviction request is satisfied by pdbs) but we are not consistent on that opinion for volumes dettachment.
So my preference would be that we add that semantic consistently to the volume detachment process. I.e I propose we add additional logic within isNodeVolumeDetachingAllowed / nodeVolumeDetachTimeoutExceeded to account for unreachable Nodes and give up after the same SkipWaitForDeleteTimeoutSeconds we have for drain.

FWIW see related old discussion when we introduced this semantic #2165 (comment)

@typeid
Copy link
Contributor Author

typeid commented May 30, 2024

That sounds good to me.

I think we should still make the drain consistent to what MAPI does for unreachable nodes: setting GracePeriod and the PodDeletionTimeout both to 1 second. While this still would not fix issue #10661 (to my current understanding), it would at least allow faster replacement of unreachable nodes. I will open a new issue for that and link a new PR to not cause extra confusion by remapping this PR.

@typeid typeid changed the title 🐛 MachineHealthCheck properly remediates unreachable nodes with volumes attached [WIP] 🐛 MachineHealthCheck properly remediates unreachable nodes with volumes attached May 30, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mjlshen May 31, 2024 10:20
@k8s-ci-robot k8s-ci-robot removed the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 31, 2024
@enxebre
Copy link
Member

enxebre commented May 31, 2024

For history context this where we introduced permanent wait on detachment:
#4707

This is where we exposed it as API input:
#6285.

Now this PR aligns the behaviour with drain for unreachable Nodes.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d99ffbe32ca8dfdfe705b34e26485170293b14c8

@enxebre
Copy link
Member

enxebre commented May 31, 2024

ptal @chrischdi @sbueringer

@sbueringer
Copy link
Member

I'll take a look, but I need some time to get a full picture

@chrischdi
Copy link
Member

Implementation makes sense to me and matches the issue

/lgtm

@sbueringer
Copy link
Member

Just a quick update, I'm looking into this. Just have to double check something with folks downstream. I'll get back to you ASAP.

@sbueringer
Copy link
Member

sbueringer commented Jun 13, 2024

/lgtm

Slightly renamed the PR, because this affects more than just MHC.

/hold
I want to check with some other folks, will get back to you ASAP

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2024
@sbueringer sbueringer changed the title 🐛 MachineHealthCheck properly remediates unreachable nodes with volumes attached 🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes Jun 13, 2024
@sbueringer
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024
@sbueringer
Copy link
Member

Thx folks!

@sbueringer
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2024
@k8s-ci-robot k8s-ci-robot merged commit 9411031 into kubernetes-sigs:main Jun 17, 2024
36 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone Jun 17, 2024
@sbueringer
Copy link
Member

/cherry-pick release-1.7

@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #10765

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MachineHealthCheck unable to remediate unreachable node with volumes attached
9 participants