Make raft reconfiguration great again! #8386

Serpentian · 2023-03-01T03:17:32Z

Serpentian
Mar 1, 2023
Collaborator

Authors: Zheleztsov N., Petrenko S., Ostanevich S.

Associated tickets

Problem overview

In production cluster configuration isn't fixed and occasionally it's neccessary to change it, for example to replace servers when they fail or to change the degree of replication. Even though the current implementation of the cluster reconfiguration works great with asyncronous replication, it is very unsafe to change configuration with Raft enabled, moreover replicaset is unavailable for synchronous writes during these changes.

The first issue, which may happen during reconfiguration in such case, is split brain: two subgroups of nodes may simulateneously elect different leaders. Let's assume, that our cluster has 3 nodes: A (master), B and C, it's synchro quourum is configured to the default N/2 + 1. We want to scale the cluster up to 5 nodes and add 2 more: D and E. If our new nodes successfully joined to the master, then A, D and E knows that the current synchro quorum is 5/2 + 1 = 3. As for B and C, which, let's assume, network partitioned from the rest of the cluster, they think that the cluster still consists of A, B and C, so the synchro quorum for them is 3/2 + 1 = 2. As they don't see any master, election on one of these nodes (B or C) will be started, moreover they will be able to finish this election successfully because of the wrong syncro quorum. So, at the end we will have two leaders: B or C and A.

The second issue, this RFC addresses, is cluster unavailability for all synchronous writes during reconfiguration. Let's consider the case: A node is in cluster, B is newly added one. If A node has a lot of data in it, then the join procedure of B will take a lot of time. As far as B affects the synchro quorum of A and cannot write any changes until fully synced, A is unable to write new transactions to the cluster.

So, we need to introduce the reconfiguration algorithm, which is safe and doesn't affect cluster availability.

Details on the current implementation of the reconfiguration process

Here's a basic overview of the current reconfiguration process. These details are needed to follow up with the changes we are proposing to make.

Let's assume that a new node is added to the cluster. It connects to the master, fetches snapshot (it's called initial join stage). As soon as initial join ends, the master inserts a new node to the _cluster space, which affects synchro quourum. _cluster space is asynchronously replicated to the rest of the nodes. After inserting a new row to the _cluster master continues joining procedure and starts to replicate all of the data, which wasn't included to the snapshot (final join stage). Currently several joining procedures may take place at the same time.

When the join procedure ends, replica makes a SUBSCRIBE request to the master, which on its side sends current vclock as a marker of the "current" state of the master. When replica fetches rows up to that position (sync stage) it enters read-write mode, usual replication flow starts.

Solution: synchronous _cluster and anon replicas.

Overview

Safety

Description

If a single configuration change adds or removes many servers asynchronously, switching the cluster directly from the old configuration to the new one can be unsafe; it isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure 1).

In the example, depicted in the Figure 1, the cluster grows from three servers to five. According to the fact that servers switch at different time, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration and another with a majority of the new configuration.

The simplest solution to overcome this problem is adding new nodes one-by-one synchronously. In such case any majority of the old cluster overlaps with any majority of the new cluster (see Figure 2). This overlap prevents the cluster from splitting into two independent majorities.

So, as far as we never insert several rows to the _cluster space during one join, we only need to implement transition of the _cluster space between synchronous and asynchronous states. Moreover, it is enough to commit updates to the _cluster space with old synchro quorum, which doesn't take into account new replica.

Why old synchro quorum is enough

Assume, we add replica D to a replicaset, which consists of replicas A, B and C. Connection between A and C is broken. A inserts a new record about adding D in *_cluster* to limbo and replicates this record to B. B replies to A, that it has successfully written it. A confirms adding D to cluster. If connection between A and B is broken in this moment, then B and C will have synchro quorum equals to 2. They will start election and will be able to finish them. As for A and D, their synchro quorum is 3, but they won't be able to commit any transaction. A abdicates leadership. Cluster of B and C continues to work.

If this connection between A and B didn't broke at this point, then A, B and D have quorum equals to 3. C, which has it equals to 2, won't be able to become leader. Cluster of A, B and D continues to work.

Implementation details

Transition of _cluster between states (see in From the user's point of view part).
Update of _cluster inside on_commit trigger.

From the user's point of view

We need to decide, when cluster's transition between asynchronous and synchronous state happens. We can make it sync as soon as PROMOTE is called (either from explicit box.ctl.promote() or from Raft's side). As for reverting to async, it shoud be only done on box.ctl.demote() call.

As soon as adding is_sync option to the space is done by updating a row in _space space, the change will be replicated to the full cluster as the second (after PROMOTE) write to WAL as soon as the leader changes.

The only way to return to multi-master configuration for the cluster is box.ctl.demote(). So, we should revert is_sync option for the _cluster at this stage. This must be done right before DEMOTE will be written.

Availability

The problem with the current join procedure is the fact, that _cluster is updated as soon as replica fetches snapshot. If making snapshots is not done very often in setup, then the synchronous cluster is unavailable until the time replica sends its SUBSCRIBE request (which is during final join stage (replicating of the WAL rows) and the time between join and subscribe).

In order to avoid availability gap, we may join all replicas as anonymous ones, so they won't be inserted to the _cluster space during join or subscribe. When their lag is small enough, they register themselves and subscribe as normal replicas. Even though this insignificantly affect join time of replica (even if cluster is asynchronous), this'll allow as to avoid synchronous cluster unavailability during reconfiguration.

Implementation details

Let's reuse the protocol of anonymous replicas: replicas issue FETCH_SNAPSHOT instead of JOIN as anonymous ones, then they sync with master (during an anonymous SUBSCRIBE), then, once they see their lag is small enough they send REGISTER request, and then a normal SUBSCRIBE.

Summary

Safety. If Raft is enabled (election_mode is anything other then off), _cluster becomes synchronous as soon as PROMOTE issued (either from explicit box.ctl.promote() or from Raft). Only box.ctl.demote() reverts _cluster space to asynchronous state. _cluster space is updated after commit.
Availability. All replicas are joined as anonymous ones, after which they transit to normal state, when they're synced.

Alternative solution

We may also implement joint consensus as described in Raft paper, which proposes to make configuration changes into several steps.

It's much more complex, but allows to update several nodes at one transaction. We never do that, only one row is inserted to the _cluster space at every join. Moreover, this solution needs significant changes to the confirming of the change in synchro space. And of course it would still require synchro _cluster space. I don't see a point in doing that.

sergepetrenko · 2023-03-01T10:11:35Z

sergepetrenko
Mar 1, 2023
Maintainer

As far as I can see there's no need to restrict simulataneous joins, it's all right to insert 2 or more updates to the _cluster space one-by-one synchronously.

It seems yes. Moreover, we may change quorum in an on_commit trigger, not in on_replace.
Here's why:
Imagine 3 servers, A, B, C, quorum is 2 / 3

It's fine for A to commit insertion of D and E after replicating only to B. Even if C continues to think that quorum is 2, noone will vote for it, because everyone else (A, B, D, E) has newer data (they know of addition of D and E to _cluster, and of its CONFIRM).

1 reply

Serpentian Mar 2, 2023
Collaborator Author

It seems that it's really enough to commit everything related to _cluster with old synchro quorum

sergepetrenko · 2023-03-01T10:15:37Z

sergepetrenko
Mar 1, 2023
Maintainer

In order to avoid availability gap, we may join all replicas as anonymous ones, so they won't be inserted to the _cluster space during join. But as soon as the master gets SUBSCRIBE (or REGISTER), relay is started and replica is synced, we update _cluster space with new synchro quorum.

Sounds rather complex. Why not reuse anonymous replica protocol fully?
There's a REGISTER request, which an anon replica may send any time to become a normal one. Let's reuse this procedure: replicas issue FETCH_SNAPSHOT instead of JOIN as anonymous ones, then they sync with master (during an anonymous SUBSCRIBE), then, once they see their lag is small enough they send REGISTER request, and then a normal SUBSCRIBE.

2 replies

Serpentian Mar 2, 2023
Collaborator Author

Great solution! Looks much cleaner

kostja Mar 3, 2023

Tarantool versio nof join as non-voter, then become a voter. Raft way is to add support for non-voters.

sergepetrenko · 2023-03-01T10:20:24Z

sergepetrenko
Mar 1, 2023
Maintainer

Looks good enough. It seems the proposal solves both mentioned issues (#6281 and #6471).
Moreover, we may solve the issues separately. One part is about anonymous join and register only when the new replica is up to date and fixes #6281. Another part is about making space _cluster synchronous, and fixes #6471.

0 replies

Serpentian · 2023-03-02T06:27:04Z

Serpentian
Mar 2, 2023
Collaborator Author

I updated RFC according to your comments. It can be checked, what parts were updated, right here:

0 replies

kostja · 2023-03-03T11:28:06Z

kostja
Mar 3, 2023

Please check out why cockroachdb switched from simple config to joint configuration support. why etcd did that. You will need to support node replace sooner or later. Joint is the only way.

0 replies

kostja · 2023-03-03T11:31:53Z

kostja
Mar 3, 2023

Switching _cluster to sync mode is "simple configuraiton changes" implemented in etcd raft. You will have to support joint configuration changes anwyay, and then you will be stuck with supporting legacy cofngiratuion changes mode.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool

Make raft reconfiguration great again! #8386

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tarantool

Make raft reconfiguration great again! #8386

Serpentian Mar 1, 2023 Collaborator

Authors: Zheleztsov N., Petrenko S., Ostanevich S.

Associated tickets

Table of contents

Problem overview

Details on the current implementation of the reconfiguration process

Solution: synchronous _cluster and anon replicas.

Overview

Safety

Description

From the user's point of view

Availability

Summary

Alternative solution

Replies: 6 comments · 3 replies

sergepetrenko Mar 1, 2023 Maintainer

Serpentian Mar 2, 2023 Collaborator Author

sergepetrenko Mar 1, 2023 Maintainer

Serpentian Mar 2, 2023 Collaborator Author

kostja Mar 3, 2023

sergepetrenko Mar 1, 2023 Maintainer

Serpentian Mar 2, 2023 Collaborator Author

kostja Mar 3, 2023

kostja Mar 3, 2023

Serpentian
Mar 1, 2023
Collaborator

Replies: 6 comments 3 replies

sergepetrenko
Mar 1, 2023
Maintainer

Serpentian Mar 2, 2023
Collaborator Author

sergepetrenko
Mar 1, 2023
Maintainer

Serpentian Mar 2, 2023
Collaborator Author

sergepetrenko
Mar 1, 2023
Maintainer

Serpentian
Mar 2, 2023
Collaborator Author

kostja
Mar 3, 2023

kostja
Mar 3, 2023