Skip to content

Commit

Permalink
Updating manual and design
Browse files Browse the repository at this point in the history
* Updates `PropsToAsciidoc`, closes #120.
* Add entry for ELECTION2 in manual.
* Create design document for ELECTION2.
* Synchronize documentation with version from site.
  • Loading branch information
jabolina committed Nov 22, 2023
1 parent 1a1ed70 commit 596488c
Show file tree
Hide file tree
Showing 9 changed files with 359 additions and 41 deletions.
2 changes: 1 addition & 1 deletion build.xml
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@
<mkdir dir="@{dir}/build"/>
<echo message="Generating index.html in @{dir}/build"/>
<exec executable="${asciidoc.executable}" dir="@{dir}">
<arg line="-n -a source-highlighter=highlightjs -a stylesheet=${asciidoctor-style} -a icons -o @{dir}/build/index.html @{input}"/>
<arg line="-r asciidoctor-diagram -n -a source-highlighter=highlightjs -a stylesheet=${asciidoctor-style} -a icons -o @{dir}/build/index.html @{input}"/>
</exec>
</sequential>
</macrodef>
Expand Down
128 changes: 128 additions & 0 deletions doc/design/Election2.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
= ELECTION2 Design
José Bolina <jbolina@redhat.com>
:description: The design of the ELECTION2 algorithm for leader election.
:homepage: http://belaban.github.io/jgroups-raft

`ELECTION2` is an alternative election algorithm.
It builds on the base `ELECTION` algorithm to include a pre-voting mechanism that verifies the need to start an election round.
The `ELECTION2` creates an enhanced verification with the view changes approach of the base algorithm, resulting in a more robust and stable leadership decision.

This document is a follow-up to the implementation.
More details are available on the GitHub issue https://github.com/jgroups-extras/jgroups-raft/issues/211[#211].

== Context

First, to give some context to the problem and the proposed solution.
In the following section, we delve into implementation details.


=== The reason for another election algorithm

The base leader election algorithm uses the view changes from JGroups to implement a more stable leader election.
Without relying solely on heartbeats, the cluster has fewer disruptions with unnecessary election rounds.
However, there are corner cases in which the cluster does not recover after a partition, leading to a liveness issue.

On a more concrete example, playing with partially connected servers triggers the original issue with `ELECTION`.
Initially, a fully connected cluster with the following settings:

[ditaa]
----
[A](5)[A,B,C,D,E]
+----------------------------------------------+
| |
| +---------------------------------+ |
| | | |
| | +--------------------+ | |
| | | | | |
v v v v v v
+-----+ +-----+ +-----+ +-----+ +-----+
|L | | | | | | | | |
| A |<--->| B |<-->| C |<-->| D |<-->| E |
| | | | | | | | | |
+-----+ +-----+ +-----+ +-----+ +-----+
^ ^ ^ ^ ^ ^
| | | | | |
| | +----------+---------+ |
| | | |
| +-------------------+ |
| |
+------------------------------------+
----

The current leader is node A, and the whole cluster has the same view __V~1~__=`[A](5)[A,B,C,D,E]`.
After a network partition, we have a partially connected cluster with the following format:

[ditaa]
----
+----------------------+
| |
v v
+-----+ +-----+ +-----+ +-----+ +-----+
| | | | | | | | | |
| A | | B |<-->| C |<-->| D | | E |
| | | | | | | | | |
+-----+ +-----+ +-----+ +-----+ +-----+
^ ^
| |
+---------------------+
----

In the worst case, some nodes can receive a view that is not including itself.
For the sake of the example, let's assume this happens with node E.
Node E receives a view update __V~2~__=`[A](2)[A,C]` from C and continues with __V~1~__ instead of updating.
Therefore, it would still see node A as the leader.
After the network recovers, if node E is the coordinator, it won't identify any change, leading to the election never starting.
For example:

----
[DEBUG] ELECTION: E: existing view: [A|4] (5) [A, B, C, D, E], new view: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E], result: no_change
----

The partial connectivity could also disrupt the cluster, where nodes could trigger the ELECTION algorithm, rapidly incrementing the term even though it can't acquire the necessary votes.
The partial connectivity is an actual production problem <<cloudflare-outage>>.
The liveness problem is a problem specific to our implementation.

=== The proposal

Implementing the Raft election by the book <<ongaro-dissertation>> involves timeouts as a simplified failure-detector.
It is easy to see how partial connectivity could lead to nodes repeatedly disrupting the cluster with election rounds.
To solve our liveness issue, we need to relax the restriction to start an election round without creating something unbounded and too disruptive.

Ongaro's dissertation <<ongaro-dissertation>> contains the pre-voting mechanism to strengthen the election and reduce disruptions.
The pre-voting is an initial verification a node executes before starting an election.
The node sends a message to verify if the other members would vote for it in case an election happens.
We adopt the pre-voting mechanism to solve the liveness issue.

== Implementation

The pre-vote is implemented in the `ELECTION2` class, leaving the `ELECTION` unchanged.
This separation should not be a problem, as `ELECTION2` only covers corner cases from the `ELECTION`, which users should not find easily.
As a rule of thumb, stick with `ELECTION`, but if deploying in an environment facing many prolonged network partitions, consider `ELECTION2`.

In our implementation, only the view coordinator executes the pre-voting.
Only a single pre-voting phase is necessary to run as many election rounds as needed.
After the pre-voting, everything follows the base ELECTION algorithm.

The coordinator running the pre-vote sends a _PreVoteRequest_ to all members of the current view.
After receiving a _PreVoteResponse(Leader)_ from all the nodes, the coordinator executes the verification to start the ELECTION algorithm.
Each node replies with its current leader, and the coordinator parses the response and runs the ELECTION algorithm when:

* The majority does not have a leader. Or;
* The majority has the same suspected leader as the coordinator.

In case there is a majority of nodes is seeing a leader __L__:

* __L__ is not in the coordinator's view;
* __L__ is in the view but replied it does not see itself as the leader.

Oversimplifying, the majority has a suspected leader or sees a leader that stepped down.
Meeting any of these requirements, the coordinator proceeds and starts the ELECTION algorithm.

The pre-voting phase of `ELECTION2` adds a delay before starting the leader election algorithm, which is the downside of `ELECTION2`.
In addition to the algorithm based on view changes, `ELECTION2` should provide a robust and stable leader election algorithm, covering the issue of `ELECTION`.

== References

* [[[cloudflare-outage,1]]] Lianza, T., &amp; Snook, C. (2020, November 27). A Byzantine failure in the real world. The Cloudflare Blog. https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/
* [[[ongaro-dissertation,2]]] Ongaro, D. (2014). Consensus: Bridging theory and practice. Stanford University.
11 changes: 11 additions & 0 deletions doc/manual/blocks.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,19 @@ protected void start(String raft_id) throws Exception {

There's a demo `ReplicatedStateMachineDemo` which can be used to interactively use `ReplicatedStateMachine`.

==== Reads and consensus

`ReplicatedStateMachine` has a configurable behavior for reads, using quorum reads by default. This means that all read
requests are sent through `RAFT` and require a majority to complete. This provides a linearizable read, but, on the other
hand, it takes longer to complete.

If an application does not require linearizable reads, it can change the behavior and only read the value locally,
possibly stale, but having a faster response since the request does not go through `RAFT`.

The behavior is changed by calling `ReplicatedStateMachine.allowDirtyReads(boolean)`.

NOTE: A quorum read creates an entry in the persistent log. See
https://github.com/belaban/jgroups-raft/issues/18[Issue 18] for details.

[[CounterService]]
=== CounterService
Expand Down
2 changes: 1 addition & 1 deletion doc/manual/manual.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Distributed consensus with jgroups-raft
:source-highlighter: pygments
:iconsdir: ./images/icons

Copyright Red Hat 2014 - 2020
Copyright Red Hat 2014 - 2023

This document is licensed under the
http://creativecommons.org/licenses/by-sa/3.0/us/legalcode["Creative Commons Attribution-ShareAlike (CC-BY-SA) 3.0"]
Expand Down
3 changes: 2 additions & 1 deletion doc/manual/overview.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ machines receive the same ordered stream of updates and thus have the exact same

Raft favors _consistency_ over _availability_; in terms of the http://en.wikipedia.org/wiki/CAP_theorem[Cap theorem],
jgroups-raft is a CP system. This means jgroups-raft is highly consistent, and the data replicated to nodes will never
diverge, even in the face of network partitions (split brains).
diverge, even in the face of network partitions (split brains), or restarts. Or, on an extended version, jgroups-raft
provides the means to build PC/EC systems concerning the https://en.wikipedia.org/wiki/PACELC_theorem[PACELC theorem].

In case of a network partition, in a cluster of `N` nodes, at least `N/2+1` nodes have to be running for the
system to be available.
Expand Down
15 changes: 15 additions & 0 deletions doc/manual/protocols.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,21 @@ Its attributes define the election timeout and the heartbeat interval (see Raft
${ELECTION}


[[ELECTION2]]
=== ELECTION2

`ELECTION2` is an alternative election algorithm.
It builds on top of <<ELECTION>> to include a pre-vote mechanism.
The pre-vote runs before delegating to the algorithm of <<ELECTION>>.

By design, <<ELECTION>> uses view changes to start election rounds and should be stable without interruptions.
`ELECTION2` is an alternative in networks with recurrent partitions that could lead to more disruptions with unnecessary election rounds.
More information about how it works is available in the design documents.


${ELECTION2}


[[RAFT]]
=== RAFT

Expand Down
23 changes: 22 additions & 1 deletion doc/manual/using.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,20 @@

== Using jgroups-raft

Since jgroups-raft build on JGroups, we follow similar http://www.jgroups.org/manual5/index.html#Requirements[requirements].
Currently, jgroups-raft requires JDK 11. jgroups-raft can be included in the POM:

[source,xml]
----
<dependency>
<groupId>org.jgroups</groupId>
<artifactId>jgroups-raft</artifactId>
<version>X.Y.Z.Final</version>
</dependency>
----

Where `X`=major, `Y`=minor, and `Z`=patch.
The tags are available on https://github.com/belaban/jgroups-raft/tags[GitHub releases].

=== Cluster members and identity

Expand Down Expand Up @@ -133,6 +147,12 @@ grow to until a snapshot is taken.

Other methods in `RaftHandle` include:

addServer(String server):: Asynchronously adds a new member to the cluster with id `server` and returns a
`CompletableFuture<byte[]>`. See more about membership change in <<DynamicMembership>>.

removeServer(String server):: Asynchronously removes a member from the cluster with id `server` and returns a
`CompletableFuture<byte[]>`. See more about membership change in <<DynamicMembership>>.

leader():: Returns the address of the current Raft leader, or null if there is no leader (e.g. in case there was no
majority to elect a leader)
isLeader():: Whether or not the current member is the leader
Expand Down Expand Up @@ -237,7 +257,8 @@ The steps to add a member are as follows (say we have `RAFT.members="A,B,C"` and
* If not, members will read the correct membership when getting initialized by their logs
* A new member `D` can now be started (its XML config needs to have the correct `members` attribute !)


Notice that membership changes survive through restarts. If a node must be removed or added, an operation must be
submitted, only restarting does not affect membership.



3 changes: 2 additions & 1 deletion src/org/jgroups/protocols/raft/RAFT.java
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,8 @@ public RAFT members(Collection<String> list) {
return this;
}

@Property public List<String> members() {
@ManagedAttribute(description = "The current list of members")
public List<String> members() {
return internal_state.getMembers();
}

Expand Down
Loading

0 comments on commit 596488c

Please sign in to comment.