-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NodeLocal DNS container hung on SIGTERM #394
Comments
What image version is this on? |
Image version 1.15.13 |
How are you updating nodelocaldns config file? Is it by editing the configmap/coreFile directly? I tried updating kube-dns configmap (which will trigger an update to the coreFile to pick up the new upstreamservers), and that reloads fine. I also passed in SIGTERM via docker to the node-local-dns pod and that was handled fine too. I tried with 1.15.13 image as well.
Logs:
Do the rules get cleaned up when deleting the pod via kubectl? What env is this being run on? |
@prameshj I updated the configmap via |
is there a way to reproduce it? |
Yes, but you'll likely have to run the test many times. Basically update the configmap then terminate the pod at about the same time. |
ok, few more questions: How is it manifesting in your environment? Is the config being changed frequently/in an automated mechanism? |
@prameshj We see the failure in our automated testing. We run Calico which uses the hostNetwork and modifies iptables rules. The SIGKILL happens because the Kubernetes termination grace period expires (default 30 seconds and we raised it to 15 minutes and still see the hang). The hang normally only happens on one node during the test. |
Update: This may not be related to the config reload. I've seen a couple instances of this problem occur when there was no reload in progress. It seems to me that this hang may be related to iptables lock contention rather than config reload. |
seems flannel-io/flannel#988 might be related. Is there any "resource unavailable" error in any of the hostNetwork pod or in journalctl/syslog on the node? |
@prameshj I'll check the next time I see the problem. |
@prameshj I didn't see any "resource unavailable" errors during the latest recreate of this problem. |
I'm experiencing a similar issue. I can confirm that this does not happen on config reloads. From a Kubernetes point of view, the pod is marked as
You can see that it finished on Oct 18th, and my last config change was on the 13th.
Extract from
It's interesting that Docker says the container's been up for 2 weeks but it also says that the status is
The logs don't contain a timestamp, but I checked on Kibana and managed to extract them:
Based on the time difference I think those messages regarding xtables lock in use are unrelated. At this point it's unclear to me what's sending the SIGTERM. This server is running on AWS, but not EKS. I'm running k8s 1.16.13. I don't see any "resource unavailable" logs. I have this node cordoned and with the pod in |
I’ve found out that the node became NotReady shortly before the node-local-dns-cache container received the SIGTERM. This happened on the two nodes were we experienced this issue. It’s interesting to note that the pod that received the SIGTERM has the node.kubernetes.io/not-ready toleration given that it belongs to a DaemonSet. |
That's interesting. Did the Node become ready again? Is node-local-dns pod still stuck while terminating? |
Yes, the node became Ready again and all the pods that needed DNS were in CrashLoopBackOff. As shown in the logs I shared, the container was not running — there was no process for it — but according to There's no trace of any SIGKILL in my case; only the SIGTERM I shared. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I realize this is from several weeks back, but any idea why they were in crashloop backoff? Was it because port 53 was in use? Is there a reliable way to repro this? |
Hey @prameshj ! I have not seen this problem again. Port 53 was not in use. Unfortunately I'm no able to reproduce this condition. |
Thanks, closing for now. |
We have not been able to recreate this problem on versions 1.16 and later. Thank you. |
@prameshj Can you please reopen this issue? I have recreated the problem on version 1.17.3. Thank you. |
Should we reopen this or do you want to create a new issue? I see there are 2 different scenarios described in this issue. Might be cleaner to open a new one with the symptoms you are seeing. |
@prameshj Thanks, I'll create a new issue on the next recreate on version 1.17.3. |
We've seen the NodeLocal DNS container hang when receiving a SIGTERM. This causes the container to eventually be terminated via SIGKILL thus leaving iptables rules behind. Here is an example of the container logs captured during this hang.
I believe the hang is occurring when SIGTERM is received during a reload of the config.
The text was updated successfully, but these errors were encountered: