Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove fail-fast on cluster bootstrap when peers discovery fails #1513

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

thampiotr
Copy link
Contributor

@thampiotr thampiotr commented Aug 20, 2024

PR Description

This removes the behaviour of failing fast when peers discovery fails during startup with clustering configured.

  • It was added originally to prevent incidents that can result in large clusters ending up in split brain.
  • However, this added friction when creating bootstrapping new clusters.
  • We decided to go back to previous behaviour because we have also made it easier to diagnose cluster split-brain scenarios, so we trust they will be easier to fix.

Which issue(s) this PR fixes

Fixes #1441

Notes to the Reviewer

Before:

piotrwork@bitp-ThinkPad-X1-Carbon-2nd ~/w/bin> ./alloy-linux-amd64 run ../empty-config.alloy --cluster.enabled --cluster.join-addresses wrong-address
ts=2024-08-20T15:23:18.54020246Z level=info "boringcrypto enabled"=false
ts=2024-08-20T15:23:18.540229048Z level=warn msg="could not find advertise address using network interfaces" service=cluster "[eth0 en0]"="falling back to localhost" err="no useable address found for interfaces [eth0 en0]: 2 errors occurred:\n\t* interface \"eth0\": route ip+net: no such network interface\n\t* interface \"en0\": route ip+net: no such network interface\n\n"
ts=2024-08-20T15:23:18.540249275Z level=info msg="running usage stats reporter"
ts=2024-08-20T15:23:18.540261028Z level=info msg="starting complete graph evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342
ts=2024-08-20T15:23:18.540272167Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=tracing duration=5.432µs
ts=2024-08-20T15:23:18.540283724Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=logging duration=82.308µs
ts=2024-08-20T15:23:18.54034534Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=remotecfg duration=49.865µs
ts=2024-08-20T15:23:18.540377173Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=livedebugging duration=7.531µs
ts=2024-08-20T15:23:18.540390804Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=otel duration=1.138µs
ts=2024-08-20T15:23:18.54040593Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=labelstore duration=4.47µs
ts=2024-08-20T15:23:18.54043175Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2024-08-20T15:23:18.540442776Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=http duration=17.944µs
ts=2024-08-20T15:23:18.540459265Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=ui duration=1.043µs
ts=2024-08-20T15:23:18.540477535Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 node_id=cluster duration=1.059µs
ts=2024-08-20T15:23:18.540488853Z level=info msg="finished complete graph evaluation" controller_path=/ controller_id="" trace_id=522419acb593bc95219efe3ea78a5342 duration=354.037µs
ts=2024-08-20T15:23:18.540585946Z level=info msg="scheduling loaded components and services"
ts=2024-08-20T15:23:18.541361538Z level=info msg="now listening for http traffic" service=http addr=127.0.0.1:12345
ts=2024-08-20T15:23:18.547902867Z level=warn msg="failed to resolve SRV records" service=cluster addr=wrong-address err="lookup wrong-address on 127.0.0.53:53: no such host"
ts=2024-08-20T15:23:18.548032388Z level=error msg="fatal error: failed to get peers to join at startup - this is likely a configuration error" service=cluster err="static peer discovery: failed to find any valid join addresses: failed to extract host and port: address wrong-address: missing port in address\nfailed to resolve SRV records: lookup wrong-address on 127.0.0.53:53: no such host"

After:

piotrwork@bitp-ThinkPad-X1-Carbon-2nd ~/w/alloy (main)> ./build/alloy run ../empty-config.alloy --cluster.enabled --cluster.join-addresses wrong-address
ts=2024-08-20T15:18:49.654876797Z level=info "boringcrypto enabled"=false
ts=2024-08-20T15:18:49.654903537Z level=warn msg="could not find advertise address using network interfaces" service=cluster "[eth0 en0]"="falling back to localhost" err="no useable address found for interfaces [eth0 en0]: 2 errors occurred:\n\t* interface \"eth0\": route ip+net: no such network interface\n\t* interface \"en0\": route ip+net: no such network interface\n\n"
ts=2024-08-20T15:18:49.654924602Z level=info msg="using provided peers for discovery" service=cluster join_peers=wrong-address
ts=2024-08-20T15:18:49.654932814Z level=info msg="running usage stats reporter"
ts=2024-08-20T15:18:49.654938345Z level=info msg="starting complete graph evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f
ts=2024-08-20T15:18:49.654953507Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=logging duration=83.215µs
ts=2024-08-20T15:18:49.654979292Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=otel duration=2.731µs
ts=2024-08-20T15:18:49.655008395Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=livedebugging duration=13.815µs
ts=2024-08-20T15:18:49.655030908Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=labelstore duration=5.755µs
ts=2024-08-20T15:18:49.655050793Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=tracing duration=5.032µs
ts=2024-08-20T15:18:49.655110337Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=remotecfg duration=45.207µs
ts=2024-08-20T15:18:49.655134151Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2024-08-20T15:18:49.65514668Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=http duration=22.868µs
ts=2024-08-20T15:18:49.655156813Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=ui duration=766ns
ts=2024-08-20T15:18:49.655179483Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f node_id=cluster duration=993ns
ts=2024-08-20T15:18:49.655195661Z level=info msg="finished complete graph evaluation" controller_path=/ controller_id="" trace_id=986117b1634ab467331a43e147f0f81f duration=383.2µs
ts=2024-08-20T15:18:49.655283312Z level=info msg="scheduling loaded components and services"
ts=2024-08-20T15:18:49.655919269Z level=info msg="now listening for http traffic" service=http addr=127.0.0.1:12345
ts=2024-08-20T15:18:49.66284907Z level=warn msg="failed to resolve provided join address" service=cluster addr=wrong-address
ts=2024-08-20T15:18:49.662986065Z level=warn msg="failed to get peers to join at startup; will create a new cluster" service=cluster err="static peer discovery: failed to find any valid join addresses: could not parse as an IP or IP:port address: \"wrong-address\"\nfailed to resolve \"A/AAAA\" records: lookup wrong-address on 127.0.0.53:53: server misbehaving\nfailed to resolve \"SRV\" records: lookup wrong-address on 127.0.0.53:53: no such host"
ts=2024-08-20T15:18:49.663031868Z level=info msg="starting cluster node" service=cluster peers_count=0 peers="" advertise_addr=127.0.0.1:12345
ts=2024-08-20T15:18:49.663339681Z level=info msg="peers changed" service=cluster peers_count=1 peers=bitp-ThinkPad-X1-Carbon-2nd
ts=2024-08-20T15:19:49.671271053Z level=warn msg="failed to resolve provided join address" service=cluster addr=wrong-address
ts=2024-08-20T15:19:49.671315959Z level=warn msg="failed to refresh list of peers" service=cluster err="static peer discovery: failed to find any valid join addresses: could not parse as an IP or IP:port address: \"wrong-address\"\nfailed to resolve \"A/AAAA\" records: lookup wrong-address on 127.0.0.53:53: server misbehaving\nfailed to resolve \"SRV\" records: lookup wrong-address on 127.0.0.53:53: no such host"
^Cinterrupt received

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@thampiotr thampiotr changed the title Remove fail-fast behaviour on cluster bootstrap when peers discovery fails Remove fail-fast on cluster bootstrap when peers discovery fails Aug 20, 2024
@thampiotr thampiotr marked this pull request as ready for review August 20, 2024 15:32
Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and simple!

@thampiotr thampiotr merged commit 5f50950 into main Aug 20, 2024
18 checks passed
@thampiotr thampiotr deleted the thampiotr/dont-fail-fast-cluster-bootstrap branch August 20, 2024 18:00
thampiotr added a commit that referenced this pull request Aug 23, 2024
thampiotr added a commit that referenced this pull request Aug 23, 2024
* Remove fail-fast behaviour on cluster bootstrap when peers discovery fails (#1513)

(cherry picked from commit 5f50950)

* Fix memory leak in `loki.process` on config update (#1431)

* Cleanup loki.process on update

* Fix goroutine leaks in other unit tests

* Refactor unit test

* Cleanup unit test code
* close output channels
* stop the updating process first

* Increase timeout for Mimir ruler test

(cherry picked from commit 5bca979)

* changelog wording

* update docker command in integration tests (#1421)

(cherry picked from commit b59e6c3)

---------

Co-authored-by: Paulin Todev <[email protected]>
Co-authored-by: William Dumont <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alloy 1.3.0 cluster mode fails to start with new cluster on non-Kubernetes platform
2 participants