Added documentation.

matter-labs · Oct 26, 2024 · 8529cdf · 8529cdf
1 parent ee3b4bb
commit 8529cdf
Show file tree

Hide file tree

Showing 4 changed files with 41 additions and 7 deletions.
diff --git a/spec/README.md b/spec/README.md
@@ -1,3 +1,34 @@
-# ChonkyBFT's Specification
+# ChonkyBFT
 
-This is a formal specification of ChonkyBFT consensus protocol in Quint.
+This folder contains the specification of the ChonkyBFT, a new consensus protocol created by Bruno França and Grzegorz Prusak at Matter Labs. It has both the pseudo-code specification that was used as the basis for the Rust implementation in the rest of this repo and the Quint specification that was used to formally verify the protocol.
+Chonky BFT is a consensus protocol inspired by [FaB Paxos](https://www.cs.cornell.edu/lorenzo/papers/Martin06Fast.pdf), [Fast-HotStuff](https://arxiv.org/abs/2010.11454) and [HotStuff-2](https://eprint.iacr.org/2023/397).
+It is committee-based and has only one round of voting, single slot finality, quadratic communication and _n=5f+1_ fault tolerance. Let's discuss what were our objectives when designing ChonkyBFT.
+
+## Design goals in practice vs. theory
+
+We find that most recent research on consensus algorithms unfortunately has become somewhat detached from the realities of running those same consensus algorithms in practice. This has led to researchers optimizing algorithms along the wrong dimensions. Many times we see tables in papers comparing different algorithms along metrics that genuinely don’t matter when those algorithms are implemented.
+
+### What doesn’t matter
+
+- Authenticator complexity: This is probably the worst one. Optimizing to have fewer signatures made sense decades ago when crypto operations were expensive. Today, digital signatures are fast and small. However, many papers (for example HotStuff) still report this measure and even go as far as suggesting threshold signatures over multisignatures, which introduces a much more complex step of distributed key generation instead of spending some more milliseconds on verifying the signatures.
+- Message complexity: This also tends to be a red herring. In theory, the fewer messages are passed around the network, the faster the algorithm will be. In practice, it depends on where the bottleneck is. If your algorithm has linear communication, but the leader still has to send and receive N messages, then you are not gaining any meaningful performance. This also has the unfortunate effect of treating every message the same, while in practice a block proposal can be megabytes long and a block commit is a few kilobytes at most.
+- Block latency: This is the wrong latency to consider. It doesn’t matter if our block time is 0.1s, if then we have to wait 100 blocks to finalize. All it matters is how long it takes for an user to see their transaction finalized. This has led to algorithms like Narwhal and Tusk, which claim to have just one round of voting but another round “hidden” in the block broadcast mechanism. This actually leads to a worse latency for the user, even though the block times are shorter.
+
+### What does matter
+
+- Systemic complexity: This relates to the [systemic vs. encapsulated complexity](https://vitalik.eth.limo/general/2022/02/28/complexity.html) topic. Our consensus algorithms are not run in isolation, they are meant to support other applications. An example of this problem is probabilistic vs provable finality. Algorithms that finalize probabilistically impose complexity on the applications. Exchanges must determine how many confirmations to wait for each different chain they accept, the same for multi-chain dapps, hybrid dapps, block explorers, wallets, etc. Algorithms that finalize provably give a clear signal to every application that they can use. This is important enough that even Ethereum is planning to move to [single-slot finality](https://ethereum.org/en/roadmap/single-slot-finality/#why-aim-for-quicker-finality), because not finalizing every block is not enough.
+- Simplicity: To model and implement the algorithm. Your algorithm might be able to save one round-trip in an optimistic scenario, but is it worth it if it’s too complex to create a formal model out of it? And if then the implementation will take 4 engineers and 3 audits? Simple algorithms that can be formally proven and are straight-forward to implement are more secure algorithms. A bug that causes downtime (or even worse, safety violations) is much worse for the UX than slightly slower block times.
+- Transaction latency: What was discussed before. The only latency that matters is the one experienced by the user.
+
+## Lessons learned
+
+For our particular use case, there are a few lessons that we learned from researching and implementing previous consensus algorithms:
+
+- Chained consensus is not worth it. It doesn’t improve the throughput or the latency while increasing systemic complexity. We always finalize every block.
+- Lower fault tolerance to reduce voting rounds. This we learned from FaB Paxos. Decreasing our fault tolerance from *3f+1* to *5f+1* allows us to finalize in just one voting round.
+- Linear communication is not worth it. Quadratic communication for replicas simplifies security (there are fewer cases where we need to consider the effect of a malicious leader), implementation (you can fully separate the leader component) and view changes (constant timeouts are enough, [Jolteon/Ditto](https://arxiv.org/abs/2106.10362) ended up going in that direction after trying to implement HotStuff). Further, the performance drop is likely not significant (see [ParBFT](https://eprint.iacr.org/2023/679.pdf)).
+- Re-proposals as a way of guaranteeing that there are no “rogue” blocks. This is a problem that didn’t get any attention so far (as far as we know), and is probably somewhat unique to public blockchains. The issue is that in all committee-based consensus algorithms it is possible that a commit QC (to use HotStuff’s terminology) is formed but that not enough replicas receive it. This will cause a timeout and another block to be proposed. Most algorithms just solve this by saying that the old block is no longer valid. All honest replicas will be in agreement about which block is canonical, but someone who just receives that single block and is not aware of the timeout will think that that particular block was finalized. This breaks the very desirable property of being able to verify that a given block is part of the chain just from seeing the block, without being required to have the entire chain. The way we solve this is to require that block proposals after a timeout (where a commit QC might have been formed) re-propose the previous block. This guarantees that if we see a block with a valid commit QC, then that block is part of the chain (maybe it wasn’t finalized in that particular view, but it was certainly finalized).
+- Always justify messages to remove time dependencies. That’s something we got from Fast-HotStuff. Messages should have enough information by themselves that any replica is capable of verifying their validity without any other information (with the exception of having previous blocks, but that’s external to the consensus algorithm anyway). If we don’t, then we introduce subtle timing dependencies. For example, Tendermint had a bug that was only discovered years later, where the solution was that the leader had to wait for the maximum network delay at the end of every round. If that wait doesn’t happen, a lock can occur. Funnily enough, Hotstuff-2 reintroduces this timing dependency in order to get rid of one round-trip, which significantly worsens the difficulty of modelling and implementing such a system.
+- Make garbage collection and reconfiguration part of the algorithm. These are parts of the algorithm that will certainly be implemented. If we don’t specify and model them before, we will be left with awkwardly implementing them later on.
+
+FaB Paxos satisfies the first 4 points and Fast-HotStuff satisfies the 5th. ChonkyBFT is basically FaB Paxos with some ideas from Fast-HotStuff/HotStuff-2.
diff --git a/spec/informal-spec/README.md b/spec/informal-spec/README.md
@@ -1,11 +1,13 @@
-# ChonkyBFT Specification
+# ChonkyBFT Informal Specification
 
-This is a ChonkyBFT specification in pseudocode.
+This is the ChonkyBFT specification in pseudocode.
+
+We’ll assume there’s a static set of nodes. Each node has 3 components: replica, proposer and fetcher. They are modeled as concurrent tasks or actors. Proposer and fetcher can read the replica state, but can’t write to it.
 
 There's a couple of considerations that are not described in the pseudo-code:
 
 - **Network model**. Messages might be delivered out of order, but we don’t guarantee eventual delivery for *all* messages. Actually, our network only guarantees eventual delivery of the most recent message for each type. That’s because each replica only stores the last outgoing message of each type in memory, and always tries to deliver those messages whenever it reconnects with another replica.
 - **Garbage collection**. We can’t store all messages, the goal here is to bound the number of messages that each replica stores, in order to avoid DoS attacks. We handle messages like this:
     - `NewView` messages are never stored, so no garbage collection is necessary.
-    - We keep all `Proposal` messages until the proposal (or a proposal with the same block number) is finalized (which means any honest replica having both the `Proposal` and the corresponding `CommitQC`, we assume that any honest replica in that situation will immediately broadcast the block on the gossip network).
+    - We keep all `Proposal` messages until the proposal (or a proposal with the same block number) is finalized (which means any honest replica having both the `Proposal` and the corresponding `CommitQC`, we assume that any honest replica in that situation will immediately broadcast the block on the p2p network.
     - We only store the newest `CommitVote` **and** `TimeoutVote` for each replica. Honest replicas only change views on QCs, so if they send a newer message, they must also have sent a `NewView` on the transition, which means we can just get the QC from that replica. Even if the other replicas don’t receive the QC, it will just trigger a reproposal.
diff --git a/spec/informal-spec/replica.rs b/spec/informal-spec/replica.rs
@@ -1,5 +1,6 @@
-// Replica
+//! Replica
 
+// This is the state machine that moves the consensus forward.
 struct ReplicaState {
     // The view this replica is currently in.
     view: ViewNumber,

diff --git a/spec/protocol-spec/README.md b/spec/protocol-spec/README.md
@@ -1,4 +1,4 @@
-# ChonkyBFT
+# ChonkyBFT Formal Specification
 
 This page summarizes the scope of the Quint specification and the experiments we
 have done with it. This Quint specification was prepared by Igor Konnov and