Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: cannot evict pod due to pdb #1508

Closed
Despire opened this issue Sep 18, 2024 · 2 comments
Closed

Bug: cannot evict pod due to pdb #1508

Despire opened this issue Sep 18, 2024 · 2 comments
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper

Comments

@Despire
Copy link
Contributor

Despire commented Sep 18, 2024

Current Behaviour

When deleting nodes from a k8s cluster we first drain the node leading to pods being evicted to other nodes of the cluster. However if this drain violates the pod disruption budge the drain will be stuck indefinitely as we do not use any timeout.

The pod disruption budged can be violated due to a unhealthy pod as well, where sometimes deleting the pod can help "unstuck" the eviction. Example below happened during a CI run, where manually deleting the unhealthy pod unstuck the eviction

coredns-58cb6bf589-fthdc                            1/1     Running            0                3h19m
coredns-58cb6bf589-h58w6                            0/1     Running            0                3h18m

coredns-58cb6bf589-fthdc                            1/1     Running            0                3h19m
coredns-58cb6bf589-h58w6                            0/1     Terminating        0                3h18m
coredns-58cb6bf589-qghgb                            1/1     Running            0                5s

coredns-58cb6bf589-5xzr5                            1/1     Running            0                3s
coredns-58cb6bf589-fthdc                            1/1     Terminating        0                3h20m
coredns-58cb6bf589-qghgb                            1/1     Running            0                9s

Expected Behaviour

There should be a timeout when draining the node so that we do not wait for it indefinitely. After the timeout we should check the logs of the output if there are issues with eviction. We then could verify if any of the pods of the deployment are unhealthy and try to restart them before retrying the drain on the node again.

Steps To Reproduce

  1. Create k8s cluster
  2. Deploy Pod disruption budged that would be violated when deleting a node from the k8s cluster
  3. delete node from the k8s cluster
  4. See eviction stuck indefinitely in kuber.
@Despire Despire added the bug Something isn't working label Sep 18, 2024
@bernardhalas
Copy link
Member

Can this scenario be detected by alerting and imply a resolution from the user?

@bernardhalas bernardhalas added the groomed Task that everybody agrees to pass the gatekeeper label Sep 27, 2024
@bernardhalas
Copy link
Member

There's a suspicion that the issue is not with PDB, but rather with the probes of coredns (or another issue with coredns). Will re-open with more details once we come across this issue again.

@bernardhalas bernardhalas closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper
Projects
None yet
Development

No branches or pull requests

2 participants