Khepri: Raft leader election times vary depending on failure scenarios #12528

Rmarian · 2024-10-16T15:09:37Z

Rmarian
Oct 16, 2024

Community Support Policy

I have read RabbitMQ's Community Support Policy

RabbitMQ version used

4.0.2

How is RabbitMQ deployed?

Generic binary package

Steps to reproduce the behavior in question

Hi RabbitMQ team,

This discussion is opened based on an existing question of mine posted on the rabbitmq-users group https://groups.google.com/g/rabbitmq-users/c/odCIK3qGpVM/m/jQk2icciAgAJ.

We started experimenting with the new Khepri datastore provided with RabbitMQ 4.0.2 and noticed the following behavior during one of our tests: if we kill the current Khepri raft leader by hard resetting its host (as reported by the log message: "the RabbitMQ metadata store: detected a new leader...") it can take as much as 12 seconds for Khepri to elect a new leader and start accepting requests again. On the other hand, election is very fast if the current leader is gracefully stopped (usually under 1 second).

We have some experience with using RAFT and know that raft has some timeout parameters that influence leader election process and think that we could tune these to make re-election faster in case the previous leader goes down unexpectedly.

But, looking at the RabbitMQ 4.0 docs at https://www.rabbitmq.com/docs/metadata-store, we could not find out if this is possible.

So, the question is, does RabbitMQ allow fine tunning the internal Khepri raft parameters at all via some environment variables? If not, is it something already on the roadmap ahead?

We are asking because we are using RabbitMQ in a safety critical system and would need faster recovery times on average.

Thanks in advance,
Radu.

Answered by kjnilsson

Oct 16, 2024

@Rmarian what is most likely happening here is that however you are force terminating your node leaves a dangling TCP connection on the other nodes so that we have to rely on the aten based failure detection instead of the more expedient erlang monitors.

Ra (the library khepri is based on) favours leader stability over leader election latency during network partitions. I still would have thought 12s would be around the top end of what you should experience. What drives most of this latency is the poll_interval setting in the aten application. in RabbitMQ this is set to a conservative 5s which affects quorum queues as well as khepri. lowering this by using the raft.adaptive_failure_detecto…

View full answer

michaelklishin · 2024-10-16T17:56:25Z

michaelklishin
Oct 16, 2024
Maintainer

@Rmarian Raft-based features in RabbitMQ use aten for peer failure detection.

Aten implements a probabilistic failure detection algorithm, so reducing the time is not a matter of setting a key to a lower value in seconds.

Here are the settings available in aten. They can be set via advanced.config.

Just like with client connection heartbeats, very low threshold values (< 5s) are guaranteed to produce false positives in reasonably loaded production systems.

0 replies

kjnilsson · 2024-10-16T19:03:47Z

kjnilsson
Oct 16, 2024
Maintainer

@Rmarian what is most likely happening here is that however you are force terminating your node leaves a dangling TCP connection on the other nodes so that we have to rely on the aten based failure detection instead of the more expedient erlang monitors.

Ra (the library khepri is based on) favours leader stability over leader election latency during network partitions. I still would have thought 12s would be around the top end of what you should experience. What drives most of this latency is the poll_interval setting in the aten application. in RabbitMQ this is set to a conservative 5s which affects quorum queues as well as khepri. lowering this by using the raft.adaptive_failure_detector.poll_interval configuration should reduce the time to detect partitions (which effectively this is given the dangling TCP connection).

I would recommend against it however. especially if you use quorum queues as frequent elections caused by false positives also have a negative effect on your availability properties

4 replies

Rmarian Oct 17, 2024
Author

@kjnilsson , @michaelklishin - thank you for the provided insight. It is very helpful!

I will play around with the raft.adaptive_failure_detector.poll_interval to see how it affects our failover latencies. I will also monitor if a more aggressive poll interval affects quorum queues on our systems over time.

We require such settings in our systems that have high speed networks and plenty of CPU/Memory capacity so we think we have a safety margin here and should not see false positives.

kjnilsson Oct 17, 2024
Maintainer

the aten heartbeating is done over the erlang distribution connection which is shared by message replication traffic so if you have traffic involving intense spikes and/or large messages then that is the point where you'd get false positives. But sure testing it with a realistic workload is sensible. A setting of 2s should get you to the ~5s mark I think. The default used to be 1s but we found that was too low as a general default as got quite a few reports of excessive elections.

kjnilsson Oct 17, 2024
Maintainer

You may also want to consider whether your failure testing (force terminated instance that doesn't also send FIN packets to open TCP connections) is realistic of the kinds of failures you may experience.

michaelklishin Oct 17, 2024
Maintainer

@Rmarian false positives won't spare systems with "plenty of CPU and memory". The amount of memory is totally irrelevant in this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Khepri: Raft leader election times vary depending on failure scenarios #12528

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Khepri: Raft leader election times vary depending on failure scenarios #12528

Rmarian Oct 16, 2024

Community Support Policy

RabbitMQ version used

How is RabbitMQ deployed?

Steps to reproduce the behavior in question

Replies: 2 comments · 4 replies

michaelklishin Oct 16, 2024 Maintainer

kjnilsson Oct 16, 2024 Maintainer

Rmarian Oct 17, 2024 Author

kjnilsson Oct 17, 2024 Maintainer

kjnilsson Oct 17, 2024 Maintainer

michaelklishin Oct 17, 2024 Maintainer

Rmarian
Oct 16, 2024

Replies: 2 comments 4 replies

michaelklishin
Oct 16, 2024
Maintainer

kjnilsson
Oct 16, 2024
Maintainer

Rmarian Oct 17, 2024
Author

kjnilsson Oct 17, 2024
Maintainer

kjnilsson Oct 17, 2024
Maintainer

michaelklishin Oct 17, 2024
Maintainer