Make raft reconfiguration great again! #8386
Replies: 6 comments 3 replies
-
It seems yes. Moreover, we may change quorum in an on_commit trigger, not in on_replace. It's fine for A to commit insertion of D and E after replicating only to B. Even if C continues to think that quorum is 2, noone will vote for it, because everyone else (A, B, D, E) has newer data (they know of addition of D and E to _cluster, and of its CONFIRM). |
Beta Was this translation helpful? Give feedback.
-
Sounds rather complex. Why not reuse anonymous replica protocol fully? |
Beta Was this translation helpful? Give feedback.
-
Looks good enough. It seems the proposal solves both mentioned issues (#6281 and #6471). |
Beta Was this translation helpful? Give feedback.
-
I updated RFC according to your comments. It can be checked, what parts were updated, right here: |
Beta Was this translation helpful? Give feedback.
-
Please check out why cockroachdb switched from simple config to joint configuration support. why etcd did that. You will need to support node replace sooner or later. Joint is the only way. |
Beta Was this translation helpful? Give feedback.
-
Switching _cluster to sync mode is "simple configuraiton changes" implemented in etcd raft. You will have to support joint configuration changes anwyay, and then you will be stuck with supporting legacy cofngiratuion changes mode. |
Beta Was this translation helpful? Give feedback.
-
Authors: Zheleztsov N., Petrenko S., Ostanevich S.
Associated tickets
Table of contents
Problem overview
Details on the current implementation
Solution: synchronous _cluster and anon replicas
Alternative solution
Problem overview
In production cluster configuration isn't fixed and occasionally it's neccessary to change it, for example to replace servers when they fail or to change the degree of replication. Even though the current implementation of the cluster reconfiguration works great with asyncronous replication, it is very unsafe to change configuration with Raft enabled, moreover replicaset is unavailable for synchronous writes during these changes.
The first issue, which may happen during reconfiguration in such case, is split brain: two subgroups of nodes may simulateneously elect different leaders. Let's assume, that our cluster has 3 nodes: A (master), B and C, it's synchro quourum is configured to the default N/2 + 1. We want to scale the cluster up to 5 nodes and add 2 more: D and E. If our new nodes successfully joined to the master, then A, D and E knows that the current synchro quorum is 5/2 + 1 = 3. As for B and C, which, let's assume, network partitioned from the rest of the cluster, they think that the cluster still consists of A, B and C, so the synchro quorum for them is 3/2 + 1 = 2. As they don't see any master, election on one of these nodes (B or C) will be started, moreover they will be able to finish this election successfully because of the wrong syncro quorum. So, at the end we will have two leaders: B or C and A.
The second issue, this RFC addresses, is cluster unavailability for all synchronous writes during reconfiguration. Let's consider the case: A node is in cluster, B is newly added one. If A node has a lot of data in it, then the join procedure of B will take a lot of time. As far as B affects the synchro quorum of A and cannot write any changes until fully synced, A is unable to write new transactions to the cluster.
So, we need to introduce the reconfiguration algorithm, which is safe and doesn't affect cluster availability.
Details on the current implementation of the reconfiguration process
Here's a basic overview of the current reconfiguration process. These details are needed to follow up with the changes we are proposing to make.
Let's assume that a new node is added to the cluster. It connects to the master, fetches snapshot (it's called initial join stage). As soon as initial join ends, the master inserts a new node to the _cluster space, which affects synchro quourum. _cluster space is asynchronously replicated to the rest of the nodes. After inserting a new row to the _cluster master continues joining procedure and starts to replicate all of the data, which wasn't included to the snapshot (final join stage). Currently several joining procedures may take place at the same time.
When the join procedure ends, replica makes a SUBSCRIBE request to the master, which on its side sends current vclock as a marker of the "current" state of the master. When replica fetches rows up to that position (sync stage) it enters read-write mode, usual replication flow starts.
Solution: synchronous _cluster and anon replicas.
Overview
Safety
Description
If a single configuration change adds or removes many servers asynchronously, switching the cluster directly from the old configuration to the new one can be unsafe; it isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure 1).
In the example, depicted in the Figure 1, the cluster grows from three servers to five. According to the fact that servers switch at different time, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration and another with a majority of the new configuration.
The simplest solution to overcome this problem is adding new nodes one-by-one synchronously. In such case any majority of the old cluster overlaps with any majority of the new cluster (see Figure 2). This overlap prevents the cluster from splitting into two independent majorities.
So, as far as we never insert several rows to the _cluster space during one join, we only need to implement transition of the _cluster space between synchronous and asynchronous states. Moreover, it is enough to commit updates to the _cluster space with old synchro quorum, which doesn't take into account new replica.
Why old synchro quorum is enough
Assume, we add replica D to a replicaset, which consists of replicas A, B and C. Connection between A and C is broken. A inserts a new record about adding D in *_cluster* to limbo and replicates this record to B. B replies to A, that it has successfully written it. A confirms adding D to cluster. If connection between A and B is broken in this moment, then B and C will have synchro quorum equals to 2. They will start election and will be able to finish them. As for A and D, their synchro quorum is 3, but they won't be able to commit any transaction. A abdicates leadership. Cluster of B and C continues to work.
If this connection between A and B didn't broke at this point, then A, B and D have quorum equals to 3. C, which has it equals to 2, won't be able to become leader. Cluster of A, B and D continues to work.
Implementation details
From the user's point of view
We need to decide, when cluster's transition between asynchronous and synchronous state happens. We can make it sync as soon as PROMOTE is called (either from explicit box.ctl.promote() or from Raft's side). As for reverting to async, it shoud be only done on box.ctl.demote() call.
As soon as adding is_sync option to the space is done by updating a row in _space space, the change will be replicated to the full cluster as the second (after PROMOTE) write to WAL as soon as the leader changes.
The only way to return to multi-master configuration for the cluster is box.ctl.demote(). So, we should revert is_sync option for the _cluster at this stage. This must be done right before DEMOTE will be written.
Availability
The problem with the current join procedure is the fact, that _cluster is updated as soon as replica fetches snapshot. If making snapshots is not done very often in setup, then the synchronous cluster is unavailable until the time replica sends its SUBSCRIBE request (which is during final join stage (replicating of the WAL rows) and the time between join and subscribe).
In order to avoid availability gap, we may join all replicas as anonymous ones, so they won't be inserted to the _cluster space during join or subscribe. When their lag is small enough, they register themselves and subscribe as normal replicas. Even though this insignificantly affect join time of replica (even if cluster is asynchronous), this'll allow as to avoid synchronous cluster unavailability during reconfiguration.
Implementation details
Let's reuse the protocol of anonymous replicas: replicas issue FETCH_SNAPSHOT instead of JOIN as anonymous ones, then they sync with master (during an anonymous SUBSCRIBE), then, once they see their lag is small enough they send REGISTER request, and then a normal SUBSCRIBE.
Summary
Safety. If Raft is enabled (election_mode is anything other then off), _cluster becomes synchronous as soon as PROMOTE issued (either from explicit box.ctl.promote() or from Raft). Only box.ctl.demote() reverts _cluster space to asynchronous state. _cluster space is updated after commit.
Availability. All replicas are joined as anonymous ones, after which they transit to normal state, when they're synced.
Alternative solution
We may also implement joint consensus as described in Raft paper, which proposes to make configuration changes into several steps.
It's much more complex, but allows to update several nodes at one transaction. We never do that, only one row is inserted to the _cluster space at every join. Moreover, this solution needs significant changes to the confirming of the change in synchro space. And of course it would still require synchro _cluster space. I don't see a point in doing that.
Beta Was this translation helpful? Give feedback.
All reactions