-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NodeLocal DNS container hung on SIGTERM #453
Comments
Doesn't nodelocaldns startup right away though? The DNS downtime should be O(seconds). Is that what you observe? It is possible for nodelocaldns to run into lock contention, but that usually has a log message. Anything in the logs? |
@prameshj I was unable to collect any valuable logs at the time of the latest failure. I assume there is some type of lock contention that causes the pod to hang on termination. NodeLocal DNS does startup right away, but our test failure comes when we verify that DNS works after disabling NodeLocal DNS. If we restart NodeLocal DNS then stop it again, that usually fixes the node. |
Ah I see. Just to confirm - 1) test disables nodelocaldns 2) nodelocaldns pod is stuck handling sigterm and is killed by kubelet with iptables rules getting left over. 3) test times out with DNS failure? Do you see a log line of the sigterm being handled -
It should call teardown in that case - Line 53 in 3b17e06
It is possible that the pod handles sigterm and tries cleaning up iptables, but cannot get the lock. We do expose a metric on port 9353 for nodelocaldns lock errors, but we do not increment it for delete errors. We should check errors in dns/cmd/node-cache/app/cache_app.go Line 157 in 3b17e06
|
@prameshj That is correct. We've updated the termination grace period to 900 seconds and still see this problem. Although, the failure rate has been much lower than it has been in the past. Given the long termination, it seems like there is a hang somewhere. We'll continue trying to collect more data when this problem occurs. |
We hit the problem again on Kubernetes version 1.20 with NodeLocal DNS version 1.17.3. Unfortunately, I don't have any additional debug data to provide. |
We hit this problem again on Kubernetes version 1.22 with NodeLocal DNS version 1.21.1. We are collecting debug data now to determine if we can find the root cause. |
Thanks. I have also opened #488 to count errors from rule deletions at teardown, in case that provides some hints. |
We hit the problem again Kubernetes version 1.22 with NodeLocal DNS version 1.21.1. Here is the end of the log captured during pod termination:
|
Any metrics from node-cache? |
@prameshj unfortunately, I don't have any metrics captured when the failure occurred. What would you like us to collect? |
Thanks, we'll update our test to collect metrics once we pull in the NodeLocal DNS cache latest version. |
We were able to recreate the problem on NodeLocal DNS version 1.21.3. Here are the logs and metrics.
|
Thanks for sharing this. However this does not include the "setup_errors_total" metric. This metric is exposed on 9353 port. The other coredns metrics from prometheus plugin are exposed on 9253. By any chance, would you be able to export these metrics to a dashboard, so we can see the values as a function of time? However, the logs don't have an entry like "Failed deleting iptables rule" - so it does not look like an iptables lock error :( |
@prameshj I'll fix our error collection to get metrics on port 9353. |
Here's recreate data for NodeLocal DNS version 1.21.3 on Kubernetes version 1.22: Logs:
Metrics:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/remove-lifecycle stale |
Apologies for not taking a look, I will try to see what's going on within the week. |
Thank you. The problem continues but is hard to recreate. If there is any debug data that you'd like me to collect when we have a recreate, please let me know. |
what happendkubernetes v1.25.6 with nodelocaldns 1.21.1 same problems
what i douse kubespray v2.21.0 install kubernetes v1.25.6. 1、when i run kubespray install i config the wrong kubelet flag at first time, so kubeadm init failed and i interrupt the init 2、 I solved the kubelet falg problem. then because ansible is idempotent, I continued to implement cluster deployment. Finally, I succeeded, but the nodelocaldns pod failed to start |
Hello, I'm experiencing somewhat related issue with dns-node-cache, leading to endless crashloop backoff. [INFO] Using Corefile /etc/coredns/Corefile Installing k8s via KubeKey |
Just popping in here to say that I'm experiencing the same behavior as OP on Kubernetes v1.24.9 with node-local-dns v1.17.4. What I discovered earlier is that the SIGTERM at pod termination hangs and a SIGKILL happens. When the new pod starts up, all DNS traffic to it seems to fail. I haven't been able to validate this on a live node, this is based on forensics via logs and metrics, so I don't know if connections are simply timing out, or being refused, or if they manage to connect to the node-local-dns service and it's unable to make outbound calls to resolve things. I've been seeing this happen periodically but didn't pin down this being the issue until today. If I get a repro I can try to provide more data. Logs just before the old pod dies:
At startup of the new pod, I do see the log entry for adding the nodelocaldns interface, and the iptables rules, but nothing further happens from that point. Traffic to the pod's metrics port does work, and I was able to get metrics from it just fine, it just reported it never received another DNS request. |
An update here: I managed to repro this today. There's definitely something unusual going on. What's happening is that at startup of the replacement pod, the iptables rules never get added. I see in the logs where it claims to add them via the |
Nice find! The nodelocaldns uses "k8s.io/kubernetes/pkg/util/iptables" to manage iptables rules. node-local-dns v1.17.4 is somewhat old and it uses @isugimpy could you try with the nodelocaldns 1.22.20 to see if the newer iptables client would work correctly? As for the nodelocaldns iptables usage it's trivial and seems correct to me: source
|
Fixed this issue with these steps,
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Hi |
@yahalomimaor in the past, I was able to fix the problem by recreating the NodeLocal DNS pod on the node encountering the problem. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
@yahalomimaor: You can't close an active issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
We are still hitting the same problem reported by #394. The test failure occurred on Kubernetes version 1.21 with NodeLocal DNS cache version 1.17.3.
To recap, NodeLocal DNS container occasionally hangs on termination causing Kubernetes to kill the container using SIGTERM after the grace period has expired. This leaves left over iptables rules on the node thus breaking DNS resolution. Our theory is that there is iptables lock contention between NodeLocal DNS, Calico and/or Kubernetes.
The text was updated successfully, but these errors were encountered: