Skip to content

Commit

Permalink
Add BENCHMARKING.md file and more comments
Browse files Browse the repository at this point in the history
Add links to the original paper, and an explanation of the overall
design to the implementation file.
  • Loading branch information
kov committed Nov 2, 2024
1 parent 7485daf commit 21e7a07
Show file tree
Hide file tree
Showing 2 changed files with 101 additions and 0 deletions.
63 changes: 63 additions & 0 deletions BENCHMARKING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Benchmarking diff

The engine used by our diff tool tries to balance execution time with patch
quality. It implements the Myers algorithm with a few heuristics which are also
used by GNU diff to avoid pathological cases.

The original paper can be found here:
- https://link.springer.com/article/10.1007/BF01840446

Currently, not all tricks used by GNU diff are adopted by our implementation.
For instance, GNU diff will isolate lines that only exist in each of the files
and not include them on the diffing process. It also does post-processing of the
edits to produce more cohesive hunks. Both of these combinar should make it
produce better patches for large files which are very different.

Run `cargo build --release` before benchmarking after you make a change!

## How to benchmark

It is recommended that you use the 'hyperfine' tool to run your benchmarks. This
is an example of how to run a comparison with GNU diff:

```
> hyperfine -N -i --warmup 2 --output=pipe 'diff t/huge t/huge.3'
'./target/release/diffutils diff t/huge t/huge.3'
Benchmark 1: diff t/huge t/huge.3
Time (mean ± σ): 136.3 ms ± 3.0 ms [User: 88.5 ms, System: 17.9 ms]
Range (min … max): 131.8 ms … 144.4 ms 21 runs
Warning: Ignoring non-zero exit code.
Benchmark 2: ./target/release/diffutils diff t/huge t/huge.3
Time (mean ± σ): 74.4 ms ± 1.0 ms [User: 47.6 ms, System: 24.9 ms]
Range (min … max): 72.9 ms … 77.1 ms 41 runs
Warning: Ignoring non-zero exit code.
Summary
./target/release/diffutils diff t/huge t/huge.3 ran
1.83 ± 0.05 times faster than diff t/huge t/huge.3
>
```

As you can see, you should provide both commands you want to compare on a single
invocation of 'hyperfine'. Each as a single argument, so use quotes. These are
the relevant parameters:

- -N: avoids using a shell as intermediary to run the command
- -i: ignores non-zero exit code, which diff uses to mean files differ
- --warmup 2: 2 runs before measuring, warms up I/O cache for large files
- --output=pipe: disable any potential optimizations based on output destination

## Inputs

Performance will vary based on several factors, the main ones being:

- how large the files being compared are
- how different the files being compared are
- how large and far between sequences of equal lines are

When looking at performance improvements, testing small and large (tens of MBs)
which have few differences, many differences, completely different is important
to cover all of the potential pathological cases.
38 changes: 38 additions & 0 deletions src/engine.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,44 @@
// For the full copyright and license information, please view the LICENSE-*
// files that was distributed with this source code.

// This engine implements the Myers diff algorithm, which uses a double-ended
// diagonal search to identify the longest common subsequence (LCS) between two
// collections. The original paper can be found here:
//
// https://link.springer.com/article/10.1007/BF01840446
//
// Unlike a naive LCS implementation, which covers all possible combinations,
// the Myers algorithm gradualy expands the search space, and only encodes
// the furthest progress made by each diagonal rather than storing each step
// of the search on a matrix.
//
// This makes it a lot more memory-efficient, as it only needs 2 * (m + n)
// positions to represent the state of the search, where m and n are the number
// of items in the collections being compared, whereas the naive LCS requires
// m * n positions.
//
// The downside is it is more compute-intensive than the naive method when
// searching through very different files. This may lead to unnacceptable run
// time in pathological cases (large, completely different files), so heuristics
// are often used to bail on the search if it gets too costly and/or a good enough
// subsequence has been found.
//
// We implement 3 main heuristics that are also used by GNU diff:
//
// 1. if we found a large enough common subsequence (also known as a 'snake')
// and have searched for a while, we return that one
//
// 2. if we have searched for a significant chunk of the collections (with a
// minimum of 4096 iterations, so we cover easy cases fully) and have not found
// one, we use whatever we have, even if it is a small snake or no snake at all
//
// 3. we keep track of the overall cost of the various searches that are done
// over the course of the divide and conquer strategy, and if that becomes too
// large we give up on trying to find long similarities altogether
//
// This last heuristic could be improved significantly in the future if we
// implement an optimization that separates items that only appear in either
// collection and remove them from the diffing process, like GNU diff does.
use std::fmt::Debug;
use std::ops::{Index, IndexMut, RangeInclusive};

Expand Down

0 comments on commit 21e7a07

Please sign in to comment.