Consul peer discovery: nodes can leave behind a service record in case of an unresolvable address #11233

frederikbosch · 2024-05-14T16:01:02Z

Describe the bug

RabbitMQ crashes when a hostname that is listed in Consul cannot be resolved to an IP address.
When RabbitMQ closes unexpectedly, a service might be left in the Consul service registry. When using orchestration tools like Nomad or Kubernetes, the orchestration tool should be made responsible for registering and deregistering services. Hence, RabbitMQ should only be reading from Consul.

Reproduction steps

In order tot test PR #11045, I used a three node cluster with two scenario's.

Also register the service using my orchestration tool (Nomad) in Consul with the meta erlang-node-name set, under the same name (rabbitmq) as RabbitMQ will register the service.
Only let RabbitMQ do the registration in Consul.

The second scenario has one big downside: what if RabbitMQ did not close properly? Then the service remains in the registry. This could lead to an unrecoverable cluster. I actually ran into this scenario. What happened?

I stopped the cluster, and RabbitMQ did not have time to shutdown properly, so the node was killed
Leaving services in the registry
When I restarted all the nodes they query consul and see the left services (status passing), and try to join it.
This results in RabbitMQ crashing, because the hostname does not resolve anymore. In my cluster Docker services only have a resolving FQDN when they are actually running.

=PROGRESS REPORT==== 14-May-2024::12:25:21.123386 ===
    supervisor: {local,inet_gethost_native_sup}
    started: [{pid,<0.97.0>},{mfa,{inet_gethost_native,init,[[]]}}]

=PROGRESS REPORT==== 14-May-2024::12:25:21.133443 ===
    supervisor: {local,kernel_safe_sup}
    started: [{pid,<0.96.0>},
              {id,inet_gethost_native_sup},
              {mfargs,{inet_gethost_native,start_link,[]}},
              {restart_type,temporary},
              {significant,false},
              {shutdown,1000},
              {child_type,worker}]

2024-05-14 12:25:21.171098+00:00 [notice] <0.44.0> Application mnesia exited with reason: stopped
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> BOOT FAILED
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> ===========
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> Exception during startup:
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> error:function_clause
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_peer_discovery:select_node_to_join/1, line 873
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>         args: [[]]
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_peer_discovery:sync_desired_cluster/3, line 206
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_db:init/0, line 66
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:-run_step/2-lc$^0/1-0-/2, line 51
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:run_step/2, line 58
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:-run_boot_steps/1-lc$^0/1-0-/1, line 22
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:run_boot_steps/1, line 23
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit:start/2, line 978
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:22.173737+00:00 [debug] <0.253.0> Set stop reason to: {error,function_clause}
2024-05-14 12:25:22.174281+00:00 [debug] <0.253.0> Change boot state to `stopped`

Although I have a scenario I can work with, I would suggest to:

do not let RabbitMQ crash when the hostname does not resolve, but rather skip the node to join
allow RabbitMQ to not register at all (via a new config cluster_formation.consul.svc_register = false)

Expected behavior

RabbitMQ should skip hostnames that cannot be reached in any manner.
RabbitMQ should have the option to leave service registration to the (container) orchestration tool

Additional context

No response

The text was updated successfully, but these errors were encountered:

dumbbell · 2024-05-14T16:08:29Z

Thank you!

michaelklishin · 2024-05-14T16:17:14Z

Both can go into 3.13.x, so I've removed the 4.x prefix. Thank you for the detailed report, @frederikbosch 👏

frederikbosch added the bug label May 14, 2024

frederikbosch mentioned this issue May 14, 2024

rabbit_peer_discovery: Fixes and improvements for Consul and etcd #11045

Merged

michaelklishin changed the title ~~[4.x] RabbitMQ cluster in unrecoverable state using Consul peer discovery~~ Consul peer discovery: nodes can leave behind a service record in case of an unresolvable address May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul peer discovery: nodes can leave behind a service record in case of an unresolvable address #11233

Consul peer discovery: nodes can leave behind a service record in case of an unresolvable address #11233

frederikbosch commented May 14, 2024

dumbbell commented May 14, 2024

michaelklishin commented May 14, 2024

Consul peer discovery: nodes can leave behind a service record in case of an unresolvable address #11233

Consul peer discovery: nodes can leave behind a service record in case of an unresolvable address #11233

Comments

frederikbosch commented May 14, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

dumbbell commented May 14, 2024

michaelklishin commented May 14, 2024