Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the merkle tree implementation perform better. #2273

Merged
merged 198 commits into from
Jul 23, 2024
Merged

Make the merkle tree implementation perform better. #2273

merged 198 commits into from
Jul 23, 2024

Conversation

eljobe
Copy link
Member

@eljobe eljobe commented May 3, 2024

Here are the main high-level ideas that this PR implements:

  1. Stop throwing away the merkle tree during Memory::resize() calls.
  2. Make Merkle have internal mutability. This means using Arc<Mutex<>>> to wrap the mutable internal data structures such that they are thread-safe and only change atomically from the point of view of the public API for the package.
  3. Create a way to track which parts of the Merkel tree have been made "dirty" (because of set calls.) Then, the root hash of the tree (and all dirty branches) are recalculated on-demand just before calculating the proof.
  4. Hard-coding constant values for the empty hashes at various layers based on the MerkleType of the tree. This means it's faster to construct a tree because we don't have to calculate and populate the empty hashes for each layer.
  5. Introduces some benchmarks and a binary for benchmarking the merkle tree's behavior and some tests.

Put it all together, and we're starting to see real performance improvements. Those are largely documented in the Linear issue linked below.

Related to https://linear.app/offchain-labs/issue/NIT-2411/arbitrator-optimizations

rauljordan and others added 30 commits April 21, 2024 04:19
Cherry-Picked from OffchainLabs/nitro@flatmerkleapril16
It can be easy to look at small average step and hash times and miss
that the total time is what we're really trying to reduce.
I definitely had some incorrect assumptions about this data structure
which made it more difficult to learn. So, I'm documenting how it
works and adding some tests.

The simple_merkle test is currently failing because the `set` method
doesn't allow setting an index larger than the largest currently
set leaf's index.

There is some debate as to whether or not this is the correct
behavior. To run the test, use:

```
$> cargo test -- --include-ignored
```
At this point, the new root hash is eagerly calculated after each call to `extend`.
If this happened frequently, it should really improve the perfomance
of the machine.  However, it looks like it doesn't happen at all with
the benchmark inputs.
This change introduces internal mutability to the merkle tree
implementation in the form of a dirty_indices HashSet<usize> which
keeps track of the indices of the parent nodes for leaves
which have changed since the last time the hashes for all of
the non-leaf nodes in the tree were calculated.

The `set` method was changed to only modify the leaf's value and
add the parent node to the set of dirty indices.

The hashes of all the non-leaf nodes which are dirty and all their
ancestors up to the root node are recalcualted just before returning
the value of the root.

This should allow us to reuse previously allocated merkle trees
rather than having to throw them out and recaculate the hashes
of all the nodes every time a leaf's value is changed.
Previously, it could hit an index out of bounds if the new leafs
caused any parent layer to grow beyond its current size.
Hopefully, this will allow us to compare this branch's implementation
of a merkle tree to the one on merkle-perf-a.
This gives us ordered traversal of the dirty indices.

The performance benchmarks show that this severely regresses over the
implmenetation on merkle-perf.
The previous implementation was growing the same layers and
dirty_indices arrays because the clone isn't deep (I guess.)
This was just for troubleshooting.
Specifically, this pre-allocates the dirty HashSets for each layer
instead of having to allocate new ones each time rehash is called.
There are a few different things going on in this commit.

1. I've added some counters for when methods get called on the Merkle
tree.

2. I've added integration with gperftools for profiling specific areas
of the code.
This allows me to profile CPU and Heap independently, and to enable
and disable the call counters independently.
This part of the code is obviously slow.
Let's see if we can improve it.
This nets a pretty variable 4%-7% savings in the micro bnechmark.
This saves every instance of a Merkle Tree from having to initialize
its own vector of empty hashes.
This part of the code is obviously slow.
Let's see if we can improve it.
This allows the most efficient lookup possible for empty layer hashes.
Yay! Tests caught typos.
eljobe added 7 commits July 18, 2024 13:58
This should always be the case because the dirt is cleaned up in
order. But, this asserttion can catch it if we get it wrong.
Keeping track of the layers_i wasn't really making the code easier to
read.
This makes the code a little cleaner, and it performs the same or slightly better.

Before:
```
Running benchmark with always merkleize feature on
avg hash time   226.487µs, avg step time        226ns, step size         1, num_iters 200, total time  45.356958ms
avg hash time   285.669µs, avg step time      3.454µs, step size      1024, num_iters 200, total time  57.836958ms
avg hash time   673.086µs, avg step time    141.763µs, step size     32768, num_iters 200, total time    162.983ms
avg hash time   800.159µs, avg step time   3.333175ms, step size   1048576, num_iters 200, total time 826.676667ms
avg hash time  2.394079ms, avg step time  54.744735ms, step size  16777216, num_iters 134, total time 7.711356917s
avg hash time  6.981576ms, avg step time 221.284475ms, step size  67108864, num_iters  33, total time    7.754069s
avg hash time 23.845249ms, avg step time 826.907254ms, step size 268435456, num_iters   8, total time 7.632934667s
```

After:
```
Running benchmark with always merkleize feature on
avg hash time   223.077µs, avg step time        196ns, step size         1, num_iters 200, total time  44.670042ms
avg hash time   283.239µs, avg step time      4.477µs, step size      1024, num_iters 200, total time  57.556875ms
avg hash time   631.034µs, avg step time    139.475µs, step size     32768, num_iters 200, total time 154.115083ms
avg hash time   813.914µs, avg step time   3.357342ms, step size   1048576, num_iters 200, total time    834.265ms
avg hash time   2.39359ms, avg step time  55.253016ms, step size  16777216, num_iters 134, total time 7.779911875s
avg hash time  6.768607ms, avg step time 222.297451ms, step size  67108864, num_iters  33, total time 7.781483917s
avg hash time 25.057057ms, avg step time 840.610754ms, step size 268435456, num_iters   8, total time 7.765957833s
```
This commit also caught an error in the logic of the resize()
impleentation, and fixes it.
Copy link
Member Author

@eljobe eljobe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing a slew of review comments. Hot-potato back to @PlasmaPower.

arbitrator/arbutil/src/types.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/memory.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/memory.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/memory.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/memory.rs Show resolved Hide resolved
arbitrator/prover/src/machine.rs Show resolved Hide resolved
arbitrator/prover/src/machine.rs Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Outdated Show resolved Hide resolved
PlasmaPower
PlasmaPower previously approved these changes Jul 19, 2024
Copy link
Collaborator

@PlasmaPower PlasmaPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have a couple of follow-up optimizations but these can be done in a new PR

// Replace the dirty leaf parents with clean ones.
let mut dirt = std::mem::replace(
&mut layers.dirty_leaf_parents,
bitvec![0; (layers.data[0].len() + 1) >> 1],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires reallocating a new bitvec. Instead, I think you can just iterate over layers.dirty_leaf_parents.iter_ones() and then at the end of the function call layers.dirty_leaf_parents.fill(false)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably just too much of a rust newbie. But, I cannot figure out how to do this without angering the borrow-checker.

The problem is that I want to have a local variable that I can use in the inner for loop (i.e. dirt.iter_ones) which I can assign the value of the new_dirt at the end of that inner loop without affecting the layers.dirty_leaf_parents values.

But, if I simply:

let mut dirt = &mut layers.dirty_leaf_parents

Then, when I try to:

dirt = new_dirt the compiler tells me I've got mismatched types because dirt is a &mut BitVec and new_dirt is just a BitVec.
But, if I try to dereference dirt for the assignment (*dirt = new_dirt), then it overwrites the value of layers.dirty_leaf_parents which means when I go to fill(false) at the end of the outter for loop, the BitVec is not big enough.
If I try to take a mutable reference to new_dirt (dirt = &mut new_dirt) the compiler is unhappy because the new_dirt variable is about to drop from scope (or maybe it's because of duplicate mutable borrows, I don't remmember.)

I suppose I could capture the size of layers.dirty_leaf_parents before the outter for loop, and then use layers.dirty_leaf_parents.fill(false), layers.dirty_leaf_parents.resize(orig_len, false) at the end of the outter loop.

Another idea I had, but discardded because the code was really ugly was to conditionally iterate over either layers.dirty_leaf_parents.iter_ones() or dirt.iter_ones depending on whether or not layer_i == 1 but that feels sloppy.

For now, I'm just using let mut dirt = layers.dirty_leaf_parents.clone(), but doesn't that also require allocating a new bitvec?

arbitrator/prover/src/memory.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/memory.rs Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Outdated Show resolved Hide resolved
arbitrator/prover/src/merkle.rs Show resolved Hide resolved
@PlasmaPower
Copy link
Collaborator

Looks like the clippy linter has some concerns https://github.com/OffchainLabs/nitro/actions/runs/10022178302/job/27701676765?pr=2273#step:20:462

I don't like the last lint, let's either disable it (as we disable others in lib.rs) or just make it into an else branch.

@eljobe
Copy link
Member Author

eljobe commented Jul 23, 2024

Looks like the clippy linter has some concerns https://github.com/OffchainLabs/nitro/actions/runs/10022178302/job/27701676765?pr=2273#step:20:462

I don't like the last lint, let's either disable it (as we disable others in lib.rs) or just make it into an else branch.

Cleaned up the lint.

@eljobe
Copy link
Member Author

eljobe commented Jul 23, 2024

@PlasmaPower, I'm going to need one more review pass from you.
I'm too new to rust (or just having a brain fart) and cannot figure out how to avoid allocating a new bitvec inside rehash.
Take a look at my longish reply above for what I've tried and failed.

Also, I don't quite follow if you are suggesting to change layers.data from a Vec<Vec<Bytes32>> to a Vec<u8> or a Vec<Vec<u8>> but I reckon you probably meant the latter. Although, I don't fully understand the suggestion about how this could make it more efficient to hash the layers. But, I can probably explore that on my own in a follow-up PR if you are cramped for time and don't want to explain it in more detail here.

@eljobe eljobe requested a review from PlasmaPower July 23, 2024 14:39
Copy link
Collaborator

@PlasmaPower PlasmaPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eljobe eljobe enabled auto-merge July 23, 2024 15:58
@eljobe eljobe merged commit 4c74ed8 into master Jul 23, 2024
14 checks passed
@eljobe eljobe deleted the merkle-perf branch July 23, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-approved s Automatically added by the CLA bot if the creator of a PR is registered as having signed the CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants