You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RabbitMQ crashes when a hostname that is listed in Consul cannot be resolved to an IP address.
When RabbitMQ closes unexpectedly, a service might be left in the Consul service registry. When using orchestration tools like Nomad or Kubernetes, the orchestration tool should be made responsible for registering and deregistering services. Hence, RabbitMQ should only be reading from Consul.
Reproduction steps
In order tot test PR #11045, I used a three node cluster with two scenario's.
Also register the service using my orchestration tool (Nomad) in Consul with the meta erlang-node-name set, under the same name (rabbitmq) as RabbitMQ will register the service.
Only let RabbitMQ do the registration in Consul.
The second scenario has one big downside: what if RabbitMQ did not close properly? Then the service remains in the registry. This could lead to an unrecoverable cluster. I actually ran into this scenario. What happened?
I stopped the cluster, and RabbitMQ did not have time to shutdown properly, so the node was killed
Leaving services in the registry
When I restarted all the nodes they query consul and see the left services (status passing), and try to join it.
This results in RabbitMQ crashing, because the hostname does not resolve anymore. In my cluster Docker services only have a resolving FQDN when they are actually running.
michaelklishin
changed the title
[4.x] RabbitMQ cluster in unrecoverable state using Consul peer discovery
Consul peer discovery: nodes can leave behind a service record in case of an unresolvable address
May 14, 2024
Describe the bug
Reproduction steps
In order tot test PR #11045, I used a three node cluster with two scenario's.
erlang-node-name
set, under the same name (rabbitmq) as RabbitMQ will register the service.The second scenario has one big downside: what if RabbitMQ did not close properly? Then the service remains in the registry. This could lead to an unrecoverable cluster. I actually ran into this scenario. What happened?
Although I have a scenario I can work with, I would suggest to:
cluster_formation.consul.svc_register = false
)Expected behavior
Additional context
No response
The text was updated successfully, but these errors were encountered: