-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using coredns daemonset instead of nodelocal dns #594
Comments
There are a number of reasons. When running as cluster DNS, CoreDNS is configured with the Kubernetes plugin. This puts a watch on all EndpointSlices and Services (and other things, depending on your config). This means a persistent connection to the API server for each instance of CoreDNS, and the API server sending watch events down that channel for any changes to those resources. For clusters with thousands of nodes, that would put a substantial burden on the API server. NodeLocalDNS, on the other hand, is only a cache and a stub resolver. It does not put a watch on the API server. This makes it much less of a burden on the API server, and also makes it a much smaller process since it does not need to use memory to hold those API resources. NodeLocalDNS also solves a second problem. Early versions of Kubernetes would sometimes have failures due to the conntrack table filling up. This was found to be because UDP entries need to age out of the conntrack table, so a burst of DNS traffic could fill that table up (I seem to recall some kernel bugs may have also been involved, but this is several years ago). NodeLocalDNS turns off connection tracking for UDP traffic to the node local DNS IP address, and it upgrades requests made to cluster DNS from UDP to TCP. TCP is not subject to this issue since entries can be removed when the connection is closed. Finally, even if we did use a DaemonSet, it wouldn't work the way you would hope. There is no guarantee that requests from a client would talk to the local CoreDNS instance. In fact, at the time NodeLocalDNS was created, it would be rare, because the local node would have no higher weight in the kube-proxy based load balancing. So if you had 1000 instances of CoreDNS, only 1/1000 would go to your local CoreDNS instance. I am not sure if that has changed, there has been some work on more topology-aware services. But I am not sure how far it has progressed - you would have to check with SIG Network. |
Thanks for the info @johnbelamaric.
|
|
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP |
Also keep in mind that with CoreDNS daemonset there will be no guarantee that client pod would talk to local CoreDNS pod on the same node. Whereas with nodelocaldns (with the iptables rules) the client is guranteed to talk to the local NLD pod on the same node.
|
|
|
Why would it require more DIY? Couldn't it be implemented into coredns directly? Another idea: have a nodelocaldns container and a coredns sidecar container in the same pod and direct traffic from nodelocaldns to coredns via localhost, this would simplify the architecture while preserving the benefits of nodelocaldns without requiring new features or code changes. |
@johnbelamaric @dpasiukevich any updates? |
Yes. It doesn't scale. |
By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin. By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this. |
What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.
Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory. |
That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.
The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.
That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load. |
I don't think it is more complex than nodelocaldns daemonset + coredns deployment + dns autoscaler, however using just nodelocaldns with the kubernetes plugin would be preferable, i'm not sure how it would deal with non-cached responses in that case though?
What is considered a large cluster? There is no info on number of services/endpoints within https://kubernetes.io/docs/setup/best-practices/cluster-large/. We are running ~500 services and ~500 endpoints.
Thanks for the clarification. |
Per the link 150000 Pods per cluster. Each pod can have multiple services and endpoints. |
I expect endpoint churn (per unit time) to be a more useful number than absolute number of endpoints. There's nothing stoping one from using a DaemonSet with maxSurge=1 and maxUnavailable=0 together with internalTrafficPolicy: Local with the vanilla coredns image. The suggested way to autoscale coredns is proportional to cluster size, exactly the same as scaling with a daemonset, except with a configurable The suggested config in the doc is As such I see no good argument to not run CoreDNS as a Daemonset. On the contrary I can think of quite a few advantages to the Daemonset approach:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
any updates? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Hello everyone , I just drafted a pull request to show how to use coredns and cilium to implement nodelocal dns, I've tested it and it worked, without duplicated Also I noticed that @johnbelamaric said
In that case, if I'm using cilium and bpf to rewrite requests, I can use coredns instead, or there's any hidden pits that I'm not aware of? |
If you are getting your requests directed locally by cilium/bpf instead of the iptables rules that NodeLocalDNS installs, then yeah, running coredns should be OK. The other thing it does is turn off connection tracking for those requests, so that you don't run into the conntrack overflow issues we have seen in the past. Does your solution handle that? There were some older kernel bugs that this also helped avoid, IIRC - not sure the status of those. As discussed above, I still would not use the standard K8s DNS Corefile though - I would create a custom one that just enables cache and maybe stub domains for this. Definitely not the K8s plugin, especially if you have a large cluster. I don't recall if NodeLocalDNS has some special stub domain support or not, where it reads stub domain definitions from the api server. That tickles a memory but it's been a long time since I looked. Also, the NodeLocalDNS build of CoreDNS is stripped down to have as small a memory footprint as possible. If you use the standard CoreDNS, it will take more memory than the node local one. Of course, you could build your own, minimal CoreDNS instance for, too. |
Thanks for your reply, here's the advantage of node-cache/NodeLocalDNS I summarized:
I think I can test or handle those issues by:
I'll do more work to see which solution to adopt. |
No, I think conntrack can fill up with any kernel version. The issue is that since UDP is connectionless, conntrack entries are expunged by timeout rather than a connection closing. AIUI that issue is unrelated to the kernel bugs which caused problems a few years ago. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The current recommended DNS architecture solution within a cluster includes NodeLocal DNS + CoreDNS deployment + DNS autoscaler.
To me it would seem preferable to use a much simpler solution - run CoreDNS as a daemonset.
Is there a downside to such a solution? Why is the recommended solution include a more complex architecture?
See also coredns/helm#86 (comment)
The text was updated successfully, but these errors were encountered: