Updating manual and design

* Updates `PropsToAsciidoc`, closes #120. * Add entry for ELECTION2 in manual. * Create design document for ELECTION2. * Synchronize documentation with version from site.
jgroups-extras · Nov 22, 2023 · 596488c · 596488c
1 parent 1a1ed70
commit 596488c
Show file tree

Hide file tree

Showing 9 changed files with 359 additions and 41 deletions.
diff --git a/build.xml b/build.xml
@@ -353,7 +353,7 @@
             <mkdir dir="@{dir}/build"/>
             <echo message="Generating index.html in @{dir}/build"/>
             <exec executable="${asciidoc.executable}" dir="@{dir}">
-                <arg line="-n -a source-highlighter=highlightjs -a stylesheet=${asciidoctor-style} -a icons -o @{dir}/build/index.html @{input}"/>
+                <arg line="-r asciidoctor-diagram -n -a source-highlighter=highlightjs -a stylesheet=${asciidoctor-style} -a icons -o @{dir}/build/index.html @{input}"/>
             </exec>
         </sequential>
     </macrodef>

diff --git a/doc/design/Election2.adoc b/doc/design/Election2.adoc
@@ -0,0 +1,128 @@
+= ELECTION2 Design
+José Bolina <jbolina@redhat.com>
+:description: The design of the ELECTION2 algorithm for leader election.
+:homepage: http://belaban.github.io/jgroups-raft
+
+`ELECTION2` is an alternative election algorithm.
+It builds on the base `ELECTION` algorithm to include a pre-voting mechanism that verifies the need to start an election round.
+The `ELECTION2` creates an enhanced verification with the view changes approach of the base algorithm, resulting in a more robust and stable leadership decision.
+
+This document is a follow-up to the implementation.
+More details are available on the GitHub issue https://github.com/jgroups-extras/jgroups-raft/issues/211[#211].
+
+== Context
+
+First, to give some context to the problem and the proposed solution.
+In the following section, we delve into implementation details.
+
+
+=== The reason for another election algorithm
+
+The base leader election algorithm uses the view changes from JGroups to implement a more stable leader election.
+Without relying solely on heartbeats, the cluster has fewer disruptions with unnecessary election rounds.
+However, there are corner cases in which the cluster does not recover after a partition, leading to a liveness issue.
+
+On a more concrete example, playing with partially connected servers triggers the original issue with `ELECTION`.
+Initially, a fully connected cluster with the following settings:
+
+[ditaa]
+----
+                                  [A](5)[A,B,C,D,E]
+
+ +----------------------------------------------+
+ |                                              |
+ | +---------------------------------+          |
+ | |                                 |          |
+ | | +--------------------+          |          |
+ | | |                    |          |          |
+ v v v                    v          v          v
++-----+     +-----+    +-----+    +-----+    +-----+
+|L    |     |     |    |     |    |     |    |     |
+|  A  |<--->|  B  |<-->|  C  |<-->|  D  |<-->|  E  |
+|     |     |     |    |     |    |     |    |     |
++-----+     +-----+    +-----+    +-----+    +-----+
+             ^   ^        ^          ^         ^  ^
+             |   |        |          |         |  |
+             |   |        +----------+---------+  |
+             |   |                   |            |
+             |   +-------------------+            |
+             |                                    |
+             +------------------------------------+
+----
+
+The current leader is node A, and the whole cluster has the same view __V~1~__=`[A](5)[A,B,C,D,E]`.
+After a network partition, we have a partially connected cluster with the following format:
+
+[ditaa]
+----
+   +----------------------+
+   |                      |
+   v                      v
++-----+     +-----+    +-----+    +-----+    +-----+
+|     |     |     |    |     |    |     |    |     |
+|  A  |     |  B  |<-->|  C  |<-->|  D  |    |  E  |
+|     |     |     |    |     |    |     |    |     |
++-----+     +-----+    +-----+    +-----+    +-----+
+                          ^                     ^
+                          |                     |
+                          +---------------------+
+----
+
+In the worst case, some nodes can receive a view that is not including itself.
+For the sake of the example, let's assume this happens with node E.
+Node E receives a view update __V~2~__=`[A](2)[A,C]` from C and continues with __V~1~__ instead of updating.
+Therefore, it would still see node A as the leader.
+After the network recovers, if node E is the coordinator, it won't identify any change, leading to the election never starting.
+For example:
+
+----
+[DEBUG] ELECTION: E: existing view: [A|4] (5) [A, B, C, D, E], new view: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E], result: no_change
+----
+
+The partial connectivity could also disrupt the cluster, where nodes could trigger the ELECTION algorithm, rapidly incrementing the term even though it can't acquire the necessary votes.
+The partial connectivity is an actual production problem <<cloudflare-outage>>.
+The liveness problem is a problem specific to our implementation.
+
+=== The proposal
+
+Implementing the Raft election by the book <<ongaro-dissertation>> involves timeouts as a simplified failure-detector.
+It is easy to see how partial connectivity could lead to nodes repeatedly disrupting the cluster with election rounds.
+To solve our liveness issue, we need to relax the restriction to start an election round without creating something unbounded and too disruptive.
+
+Ongaro's dissertation <<ongaro-dissertation>> contains the pre-voting mechanism to strengthen the election and reduce disruptions.
+The pre-voting is an initial verification a node executes before starting an election.
+The node sends a message to verify if the other members would vote for it in case an election happens.
+We adopt the pre-voting mechanism to solve the liveness issue.
+
+== Implementation
+
+The pre-vote is implemented in the `ELECTION2` class, leaving the `ELECTION` unchanged.
+This separation should not be a problem, as `ELECTION2` only covers corner cases from the `ELECTION`, which users should not find easily.
+As a rule of thumb, stick with `ELECTION`, but if deploying in an environment facing many prolonged network partitions, consider `ELECTION2`.
+
+In our implementation, only the view coordinator executes the pre-voting.
+Only a single pre-voting phase is necessary to run as many election rounds as needed.
+After the pre-voting, everything follows the base ELECTION algorithm.
+
+The coordinator running the pre-vote sends a _PreVoteRequest_ to all members of the current view.
+After receiving a _PreVoteResponse(Leader)_ from all the nodes, the coordinator executes the verification to start the ELECTION algorithm.
+Each node replies with its current leader, and the coordinator parses the response and runs the ELECTION algorithm when:
+
+* The majority does not have a leader. Or;
+* The majority has the same suspected leader as the coordinator.
+
+In case there is a majority of nodes is seeing a leader __L__:
+
+* __L__ is not in the coordinator's view;
+* __L__ is in the view but replied it does not see itself as the leader.
+
+Oversimplifying, the majority has a suspected leader or sees a leader that stepped down.
+Meeting any of these requirements, the coordinator proceeds and starts the ELECTION algorithm.
+
+The pre-voting phase of `ELECTION2` adds a delay before starting the leader election algorithm, which is the downside of `ELECTION2`.
+In addition to the algorithm based on view changes, `ELECTION2` should provide a robust and stable leader election algorithm, covering the issue of `ELECTION`.
+
+== References
+
+* [[[cloudflare-outage,1]]] Lianza, T., &amp; Snook, C. (2020, November 27). A Byzantine failure in the real world. The Cloudflare Blog. https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/
+* [[[ongaro-dissertation,2]]] Ongaro, D. (2014). Consensus: Bridging theory and practice. Stanford University.
diff --git a/doc/manual/blocks.adoc b/doc/manual/blocks.adoc
@@ -40,8 +40,19 @@ protected void start(String raft_id) throws Exception {
 
 There's a demo `ReplicatedStateMachineDemo` which can be used to interactively use `ReplicatedStateMachine`.
 
+==== Reads and consensus
+
+`ReplicatedStateMachine` has a configurable behavior for reads, using quorum reads by default. This means that all read
+requests are sent through `RAFT` and require a majority to complete. This provides a linearizable read, but, on the other
+hand, it takes longer to complete.
 
+If an application does not require linearizable reads, it can change the behavior and only read the value locally,
+possibly stale, but having a faster response since the request does not go through `RAFT`.
 
+The behavior is changed by calling `ReplicatedStateMachine.allowDirtyReads(boolean)`.
+
+NOTE: A quorum read creates an entry in the persistent log. See
+https://github.com/belaban/jgroups-raft/issues/18[Issue 18] for details.
 
 [[CounterService]]
 === CounterService

diff --git a/doc/manual/manual.adoc b/doc/manual/manual.adoc
@@ -9,7 +9,7 @@ Distributed consensus with jgroups-raft
 :source-highlighter: pygments
 :iconsdir: ./images/icons
 
-Copyright Red Hat 2014 - 2020
+Copyright Red Hat 2014 - 2023
 
 This document is licensed under the
 http://creativecommons.org/licenses/by-sa/3.0/us/legalcode["Creative Commons Attribution-ShareAlike (CC-BY-SA) 3.0"]

diff --git a/doc/manual/overview.adoc b/doc/manual/overview.adoc
@@ -13,7 +13,8 @@ machines receive the same ordered stream of updates and thus have the exact same
 
 Raft favors _consistency_ over _availability_; in terms of the http://en.wikipedia.org/wiki/CAP_theorem[Cap theorem],
 jgroups-raft is a CP system. This means jgroups-raft is highly consistent, and the data replicated to nodes will never
-diverge, even in the face of network partitions (split brains).
+diverge, even in the face of network partitions (split brains), or restarts. Or, on an extended version, jgroups-raft
+provides the means to build PC/EC systems concerning the https://en.wikipedia.org/wiki/PACELC_theorem[PACELC theorem].
 
 In case of a network partition, in a cluster of `N` nodes, at least `N/2+1` nodes have to be running for the
 system to be available.

diff --git a/doc/manual/protocols.adoc b/doc/manual/protocols.adoc
@@ -69,6 +69,21 @@ Its attributes define the election timeout and the heartbeat interval (see Raft
 ${ELECTION}
 
 
+[[ELECTION2]]
+=== ELECTION2
+
+`ELECTION2` is an alternative election algorithm.
+It builds on top of <<ELECTION>> to include a pre-vote mechanism.
+The pre-vote runs before delegating to the algorithm of <<ELECTION>>.
+
+By design, <<ELECTION>> uses view changes to start election rounds and should be stable without interruptions.
+`ELECTION2` is an alternative in networks with recurrent partitions that could lead to more disruptions with unnecessary election rounds.
+More information about how it works is available in the design documents.
+
+
+${ELECTION2}
+
+
 [[RAFT]]
 === RAFT
 

diff --git a/doc/manual/using.adoc b/doc/manual/using.adoc
@@ -1,6 +1,20 @@
 
 == Using jgroups-raft
 
+Since jgroups-raft build on JGroups, we follow similar http://www.jgroups.org/manual5/index.html#Requirements[requirements].
+Currently, jgroups-raft requires JDK 11. jgroups-raft can be included in the POM:
+
+[source,xml]
+----
+<dependency>
+    <groupId>org.jgroups</groupId>
+    <artifactId>jgroups-raft</artifactId>
+    <version>X.Y.Z.Final</version>
+</dependency>
+----
+
+Where `X`=major, `Y`=minor, and `Z`=patch.
+The tags are available on https://github.com/belaban/jgroups-raft/tags[GitHub releases].
 
 === Cluster members and identity
 
@@ -133,6 +147,12 @@ grow to until a snapshot is taken.
 
 Other methods in `RaftHandle` include:
 
+addServer(String server):: Asynchronously adds a new member to the cluster with id `server` and returns a
+`CompletableFuture<byte[]>`. See more about membership change in <<DynamicMembership>>.
+
+removeServer(String server):: Asynchronously removes a member from the cluster with id `server` and returns a
+`CompletableFuture<byte[]>`. See more about membership change in <<DynamicMembership>>.
+
 leader():: Returns the address of the current Raft leader, or null if there is no leader (e.g. in case there was no
            majority to elect a leader)
 isLeader():: Whether or not the current member is the leader
@@ -237,7 +257,8 @@ The steps to add a member are as follows (say we have `RAFT.members="A,B,C"` and
 * If not, members will read the correct membership when getting initialized by their logs
 * A new member `D` can now be started (its XML config needs to have the correct `members` attribute !)
 
-
+Notice that membership changes survive through restarts. If a node must be removed or added, an operation must be
+submitted, only restarting does not affect membership.
 
 
 
diff --git a/src/org/jgroups/protocols/raft/RAFT.java b/src/org/jgroups/protocols/raft/RAFT.java
@@ -319,7 +319,8 @@ public RAFT members(Collection<String> list) {
         return this;
     }
 
-    @Property public List<String> members() {
+    @ManagedAttribute(description = "The current list of members")
+    public List<String> members() {
         return internal_state.getMembers();
     }