From 21e7a0748d1fe92c0caf851f42ef700e9112d1e2 Mon Sep 17 00:00:00 2001 From: Gustavo Noronha Silva Date: Fri, 1 Nov 2024 10:18:51 -0300 Subject: [PATCH] Add BENCHMARKING.md file and more comments Add links to the original paper, and an explanation of the overall design to the implementation file. --- BENCHMARKING.md | 63 +++++++++++++++++++++++++++++++++++++++++++++++++ src/engine.rs | 38 +++++++++++++++++++++++++++++ 2 files changed, 101 insertions(+) create mode 100644 BENCHMARKING.md diff --git a/BENCHMARKING.md b/BENCHMARKING.md new file mode 100644 index 0000000..01a9736 --- /dev/null +++ b/BENCHMARKING.md @@ -0,0 +1,63 @@ +# Benchmarking diff + +The engine used by our diff tool tries to balance execution time with patch +quality. It implements the Myers algorithm with a few heuristics which are also +used by GNU diff to avoid pathological cases. + +The original paper can be found here: +- https://link.springer.com/article/10.1007/BF01840446 + +Currently, not all tricks used by GNU diff are adopted by our implementation. +For instance, GNU diff will isolate lines that only exist in each of the files +and not include them on the diffing process. It also does post-processing of the +edits to produce more cohesive hunks. Both of these combinar should make it +produce better patches for large files which are very different. + +Run `cargo build --release` before benchmarking after you make a change! + +## How to benchmark + +It is recommended that you use the 'hyperfine' tool to run your benchmarks. This +is an example of how to run a comparison with GNU diff: + +``` +> hyperfine -N -i --warmup 2 --output=pipe 'diff t/huge t/huge.3' +'./target/release/diffutils diff t/huge t/huge.3' +Benchmark 1: diff t/huge t/huge.3 + Time (mean ± σ): 136.3 ms ± 3.0 ms [User: 88.5 ms, System: 17.9 ms] + Range (min … max): 131.8 ms … 144.4 ms 21 runs + + Warning: Ignoring non-zero exit code. + +Benchmark 2: ./target/release/diffutils diff t/huge t/huge.3 + Time (mean ± σ): 74.4 ms ± 1.0 ms [User: 47.6 ms, System: 24.9 ms] + Range (min … max): 72.9 ms … 77.1 ms 41 runs + + Warning: Ignoring non-zero exit code. + +Summary + ./target/release/diffutils diff t/huge t/huge.3 ran + 1.83 ± 0.05 times faster than diff t/huge t/huge.3 +> +``` + +As you can see, you should provide both commands you want to compare on a single +invocation of 'hyperfine'. Each as a single argument, so use quotes. These are +the relevant parameters: + +- -N: avoids using a shell as intermediary to run the command +- -i: ignores non-zero exit code, which diff uses to mean files differ +- --warmup 2: 2 runs before measuring, warms up I/O cache for large files +- --output=pipe: disable any potential optimizations based on output destination + +## Inputs + +Performance will vary based on several factors, the main ones being: + +- how large the files being compared are +- how different the files being compared are +- how large and far between sequences of equal lines are + +When looking at performance improvements, testing small and large (tens of MBs) +which have few differences, many differences, completely different is important +to cover all of the potential pathological cases. diff --git a/src/engine.rs b/src/engine.rs index a6ef0da..56d9d14 100644 --- a/src/engine.rs +++ b/src/engine.rs @@ -3,6 +3,44 @@ // For the full copyright and license information, please view the LICENSE-* // files that was distributed with this source code. +// This engine implements the Myers diff algorithm, which uses a double-ended +// diagonal search to identify the longest common subsequence (LCS) between two +// collections. The original paper can be found here: +// +// https://link.springer.com/article/10.1007/BF01840446 +// +// Unlike a naive LCS implementation, which covers all possible combinations, +// the Myers algorithm gradualy expands the search space, and only encodes +// the furthest progress made by each diagonal rather than storing each step +// of the search on a matrix. +// +// This makes it a lot more memory-efficient, as it only needs 2 * (m + n) +// positions to represent the state of the search, where m and n are the number +// of items in the collections being compared, whereas the naive LCS requires +// m * n positions. +// +// The downside is it is more compute-intensive than the naive method when +// searching through very different files. This may lead to unnacceptable run +// time in pathological cases (large, completely different files), so heuristics +// are often used to bail on the search if it gets too costly and/or a good enough +// subsequence has been found. +// +// We implement 3 main heuristics that are also used by GNU diff: +// +// 1. if we found a large enough common subsequence (also known as a 'snake') +// and have searched for a while, we return that one +// +// 2. if we have searched for a significant chunk of the collections (with a +// minimum of 4096 iterations, so we cover easy cases fully) and have not found +// one, we use whatever we have, even if it is a small snake or no snake at all +// +// 3. we keep track of the overall cost of the various searches that are done +// over the course of the divide and conquer strategy, and if that becomes too +// large we give up on trying to find long similarities altogether +// +// This last heuristic could be improved significantly in the future if we +// implement an optimization that separates items that only appear in either +// collection and remove them from the diffing process, like GNU diff does. use std::fmt::Debug; use std::ops::{Index, IndexMut, RangeInclusive};