[clustering] failure to discover peers when the number of instances and exposed ports is large #1208

thampiotr · 2024-07-05T12:29:17Z

Issue

Alloy uses a headless service and queries DNS for SRV records to discover peer instances, which can be used to join the cluster via memberlist-based gossip protocol provided by the ckit library.

These queries were in some reported cases returning zero results, which is evidenced by a periodic logging of msg=”rejoining peers”, peers=””

The standard Go DNS library used by Alloy was returning no results for the query.
In contrast, queries run using dig CLI utility were still returning valid results.
This indicates that there was a bug or implementation difference between Go DNS library and other DNS clients.

The DNS queries are limited in size. The query is first attempted via UDP, when the response from DNS server indicates that the message was truncated, the query is retried over TCP.
However, even the TCP query can be truncated. This is because the TCP messages still wrap the RFC1035 DNS message which is prefixed with a two byte length field, so the maximum length of DNS response over TCP is 65536.

When a TCP message is truncated, the Truncated flag is set to 1.

However, as seen above, there are still useful Answer RRs in the response, which can be used on the “best effort” basis by the clients.
dig does this, but Go didn’t due to this bug: golang/go#64896
The fix for this issue should be available in go1.23

But why are the DNS query responses so large? When Alloy cluster has, for example, 90 instances and for each of these instances, there are 10 different ports registered in the headless service, there will be 900 SRV records registered in DNS.

This is quite common - initially, the Alloy helm chart only supported one port, but support for more ports was added later.
There are 9 ports defined in the k8s-monitoring helm chart.

Proposed fixes

There are a few tasks that should be taken to resolve this issue:

Tasks

Give feedback

Improve error logging when there are service discovery & clustering issues in general #1244

frozen-due-to-age
Improve testing in cluster peers discovery
Resolve inconsistencies in how host:port vs host is treated
Consider using A/AAAA records instead or as a fallback for the SRV records query
Improve Alloy dashboard to highlight which instances have split brain issues
Rate limit how often components get notified about cluster changes #1261

frozen-due-to-age
Remove the fail-fast on bootstrap discovery behaviour to improve experience when bootstrapping a cluster
Options

The text was updated successfully, but these errors were encountered:

thampiotr mentioned this issue Jul 5, 2024

Tracking: Address Clustering Issues #784

Open

thampiotr self-assigned this Jul 5, 2024

thampiotr added the pir-action-item Action Item from Post Incident Review label Jul 5, 2024

This was referenced Jul 30, 2024

Tests and foundations for new cluster peer discovery #1311

Merged

alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349

Closed

Cluster peer discovery improvements and fixes #1387

Merged

thampiotr mentioned this issue Aug 12, 2024

Add cluster peers per instance panel to cluster overview dash #1455

Merged

1 task

thampiotr closed this as completed in #1455 Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clustering] failure to discover peers when the number of instances and exposed ports is large #1208

[clustering] failure to discover peers when the number of instances and exposed ports is large #1208

thampiotr commented Jul 5, 2024 •

edited

Loading

Tasks

[clustering] failure to discover peers when the number of instances and exposed ports is large #1208

[clustering] failure to discover peers when the number of instances and exposed ports is large #1208

Comments

thampiotr commented Jul 5, 2024 • edited Loading

Issue

Proposed fixes

Tasks

thampiotr commented Jul 5, 2024 •

edited

Loading