Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clustering] failure to discover peers when the number of instances and exposed ports is large #1208

Closed
6 of 7 tasks
Tracked by #784
thampiotr opened this issue Jul 5, 2024 · 0 comments · Fixed by #1455
Closed
6 of 7 tasks
Tracked by #784
Assignees
Labels
pir-action-item Action Item from Post Incident Review

Comments

@thampiotr
Copy link
Contributor

thampiotr commented Jul 5, 2024

Issue

Alloy uses a headless service and queries DNS for SRV records to discover peer instances, which can be used to join the cluster via memberlist-based gossip protocol provided by the ckit library.

These queries were in some reported cases returning zero results, which is evidenced by a periodic logging of msg=”rejoining peers”, peers=””

The standard Go DNS library used by Alloy was returning no results for the query.
In contrast, queries run using dig CLI utility were still returning valid results.
This indicates that there was a bug or implementation difference between Go DNS library and other DNS clients.

The DNS queries are limited in size. The query is first attempted via UDP, when the response from DNS server indicates that the message was truncated, the query is retried over TCP.
However, even the TCP query can be truncated. This is because the TCP messages still wrap the RFC1035 DNS message which is prefixed with a two byte length field, so the maximum length of DNS response over TCP is 65536.

When a TCP message is truncated, the Truncated flag is set to 1.
image

However, as seen above, there are still useful Answer RRs in the response, which can be used on the “best effort” basis by the clients.
dig does this, but Go didn’t due to this bug: golang/go#64896
The fix for this issue should be available in go1.23

But why are the DNS query responses so large? When Alloy cluster has, for example, 90 instances and for each of these instances, there are 10 different ports registered in the headless service, there will be 900 SRV records registered in DNS.

This is quite common - initially, the Alloy helm chart only supported one port, but support for more ports was added later.
There are 9 ports defined in the k8s-monitoring helm chart.

Proposed fixes

There are a few tasks that should be taken to resolve this issue:

Tasks

  1. frozen-due-to-age
  2. frozen-due-to-age
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pir-action-item Action Item from Post Incident Review
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant