Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: issues with DaemonSet pods not coming up after a series of reboots #9870

Closed
Tracked by #9825
smira opened this issue Dec 3, 2024 · 0 comments · Fixed by #9876
Closed
Tracked by #9825

test: issues with DaemonSet pods not coming up after a series of reboots #9870

smira opened this issue Dec 3, 2024 · 0 comments · Fixed by #9876

Comments

@smira
Copy link
Member

smira commented Dec 3, 2024

In Talos integration tests, the tests do a series of (pretty frequent) reboots, specifically for the tests which run with encrypted volumes, as the encryption config changes require a reboot.

This got specifically triggered by the test in #9834, which adds two more reboots, and due to the order of the tests, comes right after encryption tests.

The issue is somewhat random and pops up as the test times out on the cluster health check with the error that number of read kube-proxy pods doesn't reach the desired value (3 out of 4).

Analysis

When the node goes into a reboot cycle, Talos instructs the kubelet to do a graceful shutdown, which terminates the pods, including DaemonSet pods. There is a bit of a race with kube-scheduler there, but in the end there will be a pod in the phase Failed because kubelet denies new pods to be run as it is itself in the graceful shutdown phase.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-12-03T14:45:34Z"
    message: Pod was terminated in response to imminent node shutdown.
    reason: TerminationByKubelet
    status: "True"
    type: DisruptionTarget

As the machine comes back up after a reboot, an existing pod in the Failed state prevents a new pod to be scheduled for the node for some time.

The Failed pods are supposed to be cleaned up by the DaemonSetsControllers in the kube-controller-manager: https://github.com/kubernetes/kubernetes/blob/8046362e6ff74ee18776e0cdb90ead62c577d607/pkg/controller/daemon/daemon_controller.go#L804-L826

That cleanup has a backoff introduced in kubernetes/kubernetes#65309, to fight other issues related to misconfigured pods.

But after a series of reboots a failed pod cleanup will be delayed due to the backoff mechanism long enough for Talos cluster health checks to fail:

I1203 15:12:52.834783       1 daemon_controller.go:813] "Deleting failed pod on node has been limited by backoff" logger="daemonset-controller" pod="kube-system/kube-flannel-zftvs" node="talos-default-worker-1" currentDelay="4m16s"

The more the rate of reboots is, the issue will pop up more often.

Solutions

It looks like backoff is hardcoded and can't be removed/reconfigured via any options: https://github.com/kubernetes/kubernetes/blob/8046362e6ff74ee18776e0cdb90ead62c577d607/cmd/kube-controller-manager/app/apps.go#L51-L52

  1. Restarting kube-controller-manager on reboots (e.g. on non-controlplane reboots). It should help, as backoff is in memory.
  2. Restarting the affected daemonsets due to the condition on the backoff key: https://github.com/kubernetes/kubernetes/blob/8046362e6ff74ee18776e0cdb90ead62c577d607/pkg/controller/daemon/daemon_controller.go#L1353-L1356. If the observed generation goes up, the backoff key will change.
  3. Adding more worker nodes to the tests. We have just one, and that bothers me. I would prefer to have at least two. As the backoff key depends on the nodeName, this would give us roughly twice the reboot rate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant