Add cmp utility #88

kov · 2024-09-25T15:43:02Z

The utility should support all the functionality supported by GNU cmp and perform quite a bit better. I split the commits for the actual utility in 3, with the base one being a full implementation of the features of GNU cmp, integration tests included.

The commits on top of that add optimizations which remove some and then all of rust fmt usage for the potentially long-running loop when --verbose is passed in. I wanted to separate them, first of all because it gives us good base for debugging potential issues with the optimized version, but secondly because I would understand the project preferring the slower version with more readable code.

That said, I do think it makes sense to adopt the optimized version as the gains are massive - on the order of 100% gains on my M1 Max comparing ~36M files that are completely different. Following tests run after warming up the I/O cache:

Baseline - GNU cmp

     > hyperfine --warmup 1 -i --output=pipe \
         'cmp -l huge huge.3'
     Benchmark 1: cmp -l huge huge.3
       Time (mean ± σ):      3.237 s ±  0.014 s    [User: 2.891 s, System: 0.341 s]
       Range (min … max):    3.221 s …  3.271 s    10 runs
    
       Warning: Ignoring non-zero exit code.

Unoptimized diffutils - ~74% of the time

     > hyperfine --warmup 1 -i --output=pipe \
         '../target/release/diffutils cmp -l huge huge.3'
     Benchmark 1: ../target/release/diffutils cmp -l huge huge.3
       Time (mean ± σ):      2.392 s ±  0.009 s    [User: 1.978 s, System: 0.406 s]
       Range (min … max):    2.378 s …  2.406 s    10 runs
    
       Warning: Ignoring non-zero exit code.

Optimized diffutils - 26% of the time

     > hyperfine --warmup 1 -i --output=pipe \
         '../target/release/diffutils cmp -l huge huge.3'
     Benchmark 1: ../target/release/diffutils cmp -l huge huge.3
       Time (mean ± σ):     849.5 ms ±   6.2 ms    [User: 538.3 ms, System: 306.8 ms]
       Range (min … max):   839.4 ms … 857.7 ms    10 runs
    
       Warning: Ignoring non-zero exit code.

codecov · 2024-09-25T15:47:58Z

Codecov Report

Attention: Patch coverage is 96.01050% with 76 lines in your changes missing coverage. Please review.

Project coverage is 84.95%. Comparing base (9103365) to head (fac8dab).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
src/cmp.rs	93.13%	66 Missing ⚠️
tests/integration.rs	99.15%	6 Missing ⚠️
src/main.rs	94.28%	2 Missing ⚠️
src/utils.rs	92.59%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #88      +/-   ##
==========================================
+ Coverage   81.01%   84.95%   +3.94%     
==========================================
  Files          10       12       +2     
  Lines        4245     5824    +1579     
  Branches      397      480      +83     
==========================================
+ Hits         3439     4948    +1509     
- Misses        806      856      +50     
- Partials        0       20      +20

Flag	Coverage Δ
macos_latest	`85.02% <96.25%> (+3.89%)`	⬆️
ubuntu_latest	`85.27% <96.41%> (+3.93%)`	⬆️
windows_latest	`22.93% <27.48%> (+1.71%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sylvestre · 2024-09-25T20:15:45Z

could you please run your benchmarks with hyperfine? time is too limited

kov · 2024-09-25T21:03:10Z

could you please run your benchmarks with hyperfine? time is too limited

Will do! Have begun working on fuzzing support, fwiw, need to do some refactoring as I currently use process::exit() as a shortcut in a lot of places.

kov · 2024-09-26T11:44:20Z

Added fuzz implementation to the first utility commit, ran it overnight with no issues spotted. Replaced my hand crafted benchmarks with hyperfine output (and made sure to have the system free of memory pressure / cpu interference).

I have tried to run the GNU tests, but the script fails here. I only changed it to also symlink diffutils to cmp and to run the cmp tests. Can anyone spot anything obvious before I start diving in to investigate?

kov@jabuticaba ~/P/diffutils (cmp)> ./tests/run-upstream-testsuite.sh
Fetching upstream test suite from https://git.savannah.gnu.org/git/diffutils.git
Running 31 tests
./tests/run-upstream-testsuite.sh: line 96: cd: gt-basic.*: No such file or directory
  basic                                    FAIL
./tests/run-upstream-testsuite.sh: line 96: cd: gt-bignum.*: No such file or directory
  bignum                                   FAIL
./tests/run-upstream-testsuite.sh: line 96: cd: gt-binary.*: No such file or directory
  binary                                   FAIL
./tests/run-upstream-testsuite.sh: line 96: cd: gt-brief-vs-stat-zero-kernel-lies.*: No such file or directory
  brief-vs-stat-zero-kernel-lies           FAIL
...

diff --git a/tests/run-upstream-testsuite.sh b/tests/run-upstream-testsuite.sh
index cb59834..f75b0b3 100755
--- a/tests/run-upstream-testsuite.sh
+++ b/tests/run-upstream-testsuite.sh
@@ -21,7 +21,7 @@
 # (e.g. 'dev' or 'test').
 # Unless overridden by the $TESTS environment variable, all tests in the test
 # suite will be run. Tests targeting a command that is not yet implemented
-# (e.g. cmp, diff3 or sdiff) are skipped.
+# (e.g. diff3 or sdiff) are skipped.
 
 scriptpath=$(dirname "$(readlink -f "$0")")
 rev=$(git rev-parse HEAD)
@@ -57,6 +57,7 @@ upstreamrev=$(git rev-parse HEAD)
 mkdir src
 cd src
 ln -s "$binary" diff
+ln -s "$binary" cmp
 cd ../tests
 
 if [[ -n "$TESTS" ]]
@@ -82,9 +83,9 @@ for test in $tests
 do
   result="FAIL"
   url="$urlroot$test?id=$upstreamrev"
-  # Run only the tests that invoke `diff`,
+  # Run only the tests that invoke `diff` or `cmp`,
   # because other binaries aren't implemented yet
-  if ! grep -E -s -q "(cmp|diff3|sdiff)" "$test"
+  if ! grep -E -s -q "(diff3|sdiff)" "$test"
   then
     sh "$test" 1> stdout.txt 2> stderr.txt && result="PASS" || exitcode=1
     json+="{\"test\":\"$test\",\"result\":\"$result\","

oSoMoN · 2024-09-26T19:58:53Z

clippy is reporting 9 trivial errors, could you address them to get the CI results green?

oSoMoN · 2024-09-26T20:12:45Z

I have tried to run the GNU tests, but the script fails here. I only changed it to also symlink diffutils to cmp and to run the cmp tests. Can anyone spot anything obvious before I start diving in to investigate?

This is a regression that also affects main, so not introduced by your changes. I've filed #90 and am investigating it.

oSoMoN · 2024-09-26T20:55:08Z

I have tried to run the GNU tests, but the script fails here. I only changed it to also symlink diffutils to cmp and to run the cmp tests. Can anyone spot anything obvious before I start diving in to investigate?

This is a regression that also affects main, so not introduced by your changes. I've filed #90 and am investigating it.

With the fix in #91, I ran the test suite on your branch, and I'm seeing the following differences:

@@ -3,9 +3,9 @@
   basic                                    PASS
   bignum                                   PASS
   binary                                   FAIL
-  brief-vs-stat-zero-kernel-lies           SKIP
+  brief-vs-stat-zero-kernel-lies           FAIL
   bug-64316                                PASS
-  cmp                                      SKIP
+  cmp                                      FAIL
   colliding-file-names                     FAIL
   diff3                                    SKIP
   excess-slash                             FAIL
@@ -30,9 +30,9 @@
   strip-trailing-cr                        FAIL
   timezone                                 PASS
   colors                                   FAIL
-  y2038-vs-32bit                           SKIP
+  y2038-vs-32bit                           PASS
 
-Summary: TOTAL: 31 / PASS: 6 / FAIL: 20 / SKIP: 5
+Summary: TOTAL: 31 / PASS: 7 / FAIL: 22 / SKIP: 2
 
 Results written to /home/osomon/build/uutils/diffutils/tests/test-results.json

The failure in the cmp test needs to be investigated for sure, but this looks promising.

This is in preparation for adding the other diffutils commands, cmp, diff3, sdiff. We use a similar strategy to uutils/coreutils, with the single binary acting as one of the supported tools if called through a symlink with the appropriate name. When using the multi-tool binary directly, the utility needds to be the first parameter.

kov · 2024-09-27T00:49:07Z

Fixed the clippy complaints and quite close to getting the cmp test to pass locally. I am a bit confused, though, as my system's GNU cmp also doesn't pass the tests. I think the test infrastructure may not be sanitizing locale environments properly or there is something else on my Fedora it is not liking.

I got the failures on our tool down to a minimum, but still hit these, which the GNU cmp built from git also hits. Wonder if there is a weird corner case in my system messing things up:

kov@jabuticaba ~/P/d/t/d/tests (master)> which cmp
/home/kov/.local/bin/cmp
kov@jabuticaba ~/P/d/t/d/tests (master)> ls -lh /home/kov/.local/bin/cmp
lrwxrwxrwx. 1 kov kov 51 set 26 22:11 /home/kov/.local/bin/cmp -> /home/kov/Projects/diffutils/target/debug/diffutils*
kov@jabuticaba ~/P/d/t/d/tests (master) [1]> cargo build; and env LANG=C ./cmp 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
LC_ALL=C cmp -b bad bug
bad bug differ: byte 2, line 1 is 141 a 165 u
cmp: invalid --ignore-initial value '99999999999999999999999999999999999999999999999999999999999'
cmp: invalid --bytes value '99999999999999999999999999999999999999999999999999999999999'
kov@jabuticaba ~/P/d/t/d/tests (master)> ln -fs ~/Projects/diffutils/t/cmp ~/.local/bin/cmp # this is GNU cmp built from the tree
kov@jabuticaba ~/P/d/t/d/tests (master)> env LC_ALL=C ./cmp
LC_ALL=C cmp -b bad bug
bad bug differ: byte 2, line 1 is 141 a 165 u
cmp: invalid --ignore-initial value '99999999999999999999999999999999999999999999999999999999999'
cmp: Try 'cmp --help' for more information.
cmp: invalid --ignore-initial value '1000'
cmp: Try 'cmp --help' for more information.

I checked with strace that the cmp being run was appropriate fwiw. The last error message from GNU cmp is a real wth xD

oSoMoN · 2024-09-27T11:36:02Z

I checked with strace that the cmp being run was appropriate fwiw. The last error message from GNU cmp is a real wth xD

I ran LANG=C ./cmp from the tests directory in a local checkout of the upstream repository, and I'm not seeing the "invalid --ignore-initial value" error that you're observing. I do see it when using cmp built from your branch.

Note that the cmp test script prepends ../src to $PATH. Could it be that you had symlinked the wrong executable in there?
I'm seeing a different behaviour when I run that same test script against my system-wide /usr/bin/cmp installed from distro packages (using Ubuntu 24.04).

The following upstream commits seem relevant: 4ee8300 and 9c5fcbd (and they were added after the last upstream release).

oSoMoN · 2024-09-27T11:58:13Z

Comparing further the output of the cmp test script, it looks like your implementation conforms to the released version of GNU cmp (version 3.10), i.e. it chokes on large numbers passed to -i/--ignore-initial.

We have two options here:

align your implementation with the latest unreleased changes in the GNU diffutils repository
change tests/run-upstream-testsuite.sh to checkout a released version of the test suite, not the latest HEAD

I think that both approaches are equally interesting and valid (one favours the bleeding edge, the other compatibility with released versions), so feel free to pick whichever you prefer (I can help with updating the test script if you decide to go that way).

kov · 2024-09-27T15:45:16Z

Getting there, I went with updating the implementation to allow huge numbers - forward looking probably makes more sense =)

I just need to investigate this one now:

brief-vs-stat-zero-kernel-lies SKIP

brief-vs-stat-zero-kernel-lies FAIL

Will hopefully get to it this evening. Thanks for the help so far!

oSoMoN · 2024-09-27T17:41:17Z

I just need to investigate this one now:

* brief-vs-stat-zero-kernel-lies           SKIP
* brief-vs-stat-zero-kernel-lies           FAIL

I looked into it, and it turns out it's an issue in the test suite runner, not in your code. I filed #92 to track it.

kov · 2024-09-28T10:47:10Z

I created another fuzz target for parameter parsing and left it running overnight with no issues, do you think having this target is also useful? I think the hand-crafted argument parsing is more likely to go wrong, now that I think about it.

cat fuzz/fuzz_targets/fuzz_cmp_args.rs
#![no_main]
#[macro_use]
extern crate libfuzzer_sys;
use diffutilslib::cmp;

use std::ffi::OsString;

fuzz_target!(|x: Vec<OsString>| {
    let _ = cmp::parse_params(x.into_iter().peekable());
});

Any thoughts on how we could deny corpus entries to make it more useful? I thought about denying certain sizes if they do not have any -x or --whatever, or at least one of the known parameters. Like, if x.len() > 4 it needs to have some known parameters, but that excludes having over 4 positional parameters quite often, I suppose. Could be worth the tradeoff, still.

oSoMoN · 2024-09-28T16:50:51Z

I created another fuzz target for parameter parsing and left it running overnight with no issues, do you think having this target is also useful? I think the hand-crafted argument parsing is more likely to go wrong, now that I think about it.

Yes, I think it is definitely useful. For the record, we implement argument parsing by hand because ready-made argument parsers like clap do not offer the flexibility we need to replicate the GNU diffutils arguments. We'd lose in compatibility what we would gain in code simplicity.

Any thoughts on how we could deny corpus entries to make it more useful? I thought about denying certain sizes if they do not have any -x or --whatever, or at least one of the known parameters. Like, if x.len() > 4 it needs to have some known parameters, but that excludes having over 4 positional parameters quite often, I suppose. Could be worth the tradeoff, still.

I don't have prior experience with this, but perhaps we could use a dictionary?

One additional thought: I see that you wrote integration tests for the new cmp command. Those are good and they provide a decent code coverage, but I would also suggest writing unit tests for parse_params(…) trying to bump the coverage as much as possible. With the combination of unit tests, integration tests and targeted fuzzing we should have our backs covered.

kov · 2024-09-29T12:52:16Z

Added a dictionary, but ended up going with some Corpus rejection as well, as using the dictionary requires passing parameters to cargo fuzz, may be forgotten. The github action looks like it would need to be split per target if we need to pass dictionaries there - which makes me realize I never added the fuzz target there, will fix right now.

In addition to the param parsing I added yesterday, I also added a few more integration tests to cover more of the fast paths that had no coverage (some of them were covered by the GNU cmp tests, but good to have our own tests as well I suppose).

kov · 2024-09-29T12:59:49Z

I may need to set up a build on a Windows VM xD

oSoMoN

This is looking really good, thanks for such a high-quality first contribution!
I have a handful of minor suggestions/questions, see inline.

tests/integration.rs

src/main.rs

src/cmp.rs

The utility should support all the arguments supported by GNU cmp and perform slightly better. On a "bad" scenario, ~36M files which are completely different, our version runs in ~72% of the time of the original on my M1 Max: > hyperfine --warmup 1 -i --output=pipe \ 'cmp -l huge huge.3' Benchmark 1: cmp -l huge huge.3 Time (mean ± σ): 3.237 s ± 0.014 s [User: 2.891 s, System: 0.341 s] Range (min … max): 3.221 s … 3.271 s 10 runs Warning: Ignoring non-zero exit code. > hyperfine --warmup 1 -i --output=pipe \ '../target/release/diffutils cmp -l huge huge.3' Benchmark 1: ../target/release/diffutils cmp -l huge huge.3 Time (mean ± σ): 2.392 s ± 0.009 s [User: 1.978 s, System: 0.406 s] Range (min … max): 2.378 s … 2.406 s 10 runs Warning: Ignoring non-zero exit code. Our cmp runs in ~116% of the time when comparing libxul.so to the chromium-browser binary with -l and -b. In a best case scenario of comparing 2 files which are the same except for the last byte, our tool is slightly faster.

Octal conversion and simple integer to string both show up in profiling. This change improves comparing ~36M completely different files wth both -l and -b by ~11-13%.

This makes the code less readable, but gets us a massive improvement to performance. Comparing ~36M completely different files now takes ~40% of the time. Compared to GNU cmp, we now run the same comparison in ~26% of the time. This also improves comparing binary files. A comparison of chromium and libxul now takes ~60% of the time. We also beat GNU cmpi by about the same margin. Before: > hyperfine --warmup 1 -i --output=pipe \ '../target/release/diffutils cmp -l huge huge.3' Benchmark 1: ../target/release/diffutils cmp -l huge huge.3 Time (mean ± σ): 2.000 s ± 0.016 s [User: 1.603 s, System: 0.392 s] Range (min … max): 1.989 s … 2.043 s 10 runs Warning: Ignoring non-zero exit code. > hyperfine --warmup 1 -i --output=pipe \ '../target/release/diffutils cmp -l -b \ /usr/lib64/chromium-browser/chromium-browser \ /usr/lib64/firefox/libxul.so' Benchmark 1: ../target/release/diffutils cmp -l -b /usr/lib64/chromium-browser/chromium-browser /usr/lib64/firefox/libxul.so Time (mean ± σ): 24.704 s ± 0.162 s [User: 21.948 s, System: 2.700 s] Range (min … max): 24.359 s … 24.889 s 10 runs Warning: Ignoring non-zero exit code. After: > hyperfine --warmup 1 -i --output=pipe \ '../target/release/diffutils cmp -l huge huge.3' Benchmark 1: ../target/release/diffutils cmp -l huge huge.3 Time (mean ± σ): 849.5 ms ± 6.2 ms [User: 538.3 ms, System: 306.8 ms] Range (min … max): 839.4 ms … 857.7 ms 10 runs Warning: Ignoring non-zero exit code. > hyperfine --warmup 1 -i --output=pipe \ '../target/release/diffutils cmp -l -b \ /usr/lib64/chromium-browser/chromium-browser \ /usr/lib64/firefox/libxul.so' Benchmark 1: ../target/release/diffutils cmp -l -b /usr/lib64/chromium-browser/chromium-browser /usr/lib64/firefox/libxul.so Time (mean ± σ): 14.646 s ± 0.040 s [User: 12.328 s, System: 2.286 s] Range (min … max): 14.585 s … 14.702 s 10 runs Warning: Ignoring non-zero exit code.

oSoMoN

This looks great now, thanks!

kov force-pushed the cmp branch from 7b058b3 to 0a06127 Compare September 25, 2024 15:54

kov mentioned this pull request Sep 25, 2024

cmp command is not implemented #14

Closed

sylvestre requested a review from oSoMoN September 25, 2024 20:17

cakebaker linked an issue Sep 26, 2024 that may be closed by this pull request

cmp command is not implemented #14

Closed

kov force-pushed the cmp branch 2 times, most recently from cf8ebf5 to b60342c Compare September 26, 2024 11:41

kov force-pushed the cmp branch from b60342c to 8a26512 Compare September 26, 2024 11:47

oSoMoN mentioned this pull request Sep 26, 2024

Rename the diffutils binary to diff, #75

Closed

kov force-pushed the cmp branch from 8a26512 to f748530 Compare September 27, 2024 00:35

kov force-pushed the cmp branch from f748530 to 8847635 Compare September 27, 2024 15:41

kov force-pushed the cmp branch 2 times, most recently from 5ff1efc to a924d5e Compare September 27, 2024 16:18

kov force-pushed the cmp branch 2 times, most recently from 54aeb86 to 30b898b Compare September 29, 2024 12:46

kov force-pushed the cmp branch 2 times, most recently from a534480 to 5966d1d Compare September 29, 2024 12:59

kov force-pushed the cmp branch from 5966d1d to 082cc4d Compare September 30, 2024 12:19

kov mentioned this pull request Sep 30, 2024

How to approach --help implementation #96

Open

kov force-pushed the cmp branch from 082cc4d to be6ed5e Compare September 30, 2024 12:29

oSoMoN mentioned this pull request Sep 30, 2024

Consider using an external argument parser #97

Open

oSoMoN reviewed Sep 30, 2024

View reviewed changes

kov force-pushed the cmp branch from be6ed5e to 16eacb1 Compare October 1, 2024 00:42

kov added 3 commits October 1, 2024 13:30

cmp: avoid using advanced rust formatting for -l

2e68130

Octal conversion and simple integer to string both show up in profiling. This change improves comparing ~36M completely different files wth both -l and -b by ~11-13%.

kov force-pushed the cmp branch from 16eacb1 to fac8dab Compare October 1, 2024 16:31

oSoMoN approved these changes Oct 1, 2024

View reviewed changes

oSoMoN merged commit 763074a into uutils:main Oct 1, 2024
27 checks passed

Add cmp utility #88

Add cmp utility #88

Uh oh!

Conversation

kov commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sylvestre commented Sep 25, 2024

Uh oh!

kov commented Sep 25, 2024

Uh oh!

kov commented Sep 26, 2024

Uh oh!

oSoMoN commented Sep 26, 2024

Uh oh!

oSoMoN commented Sep 26, 2024

Uh oh!

oSoMoN commented Sep 26, 2024

Uh oh!

kov commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oSoMoN commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oSoMoN commented Sep 27, 2024

Uh oh!

kov commented Sep 27, 2024

Uh oh!

oSoMoN commented Sep 27, 2024

Uh oh!

kov commented Sep 28, 2024

Uh oh!

oSoMoN commented Sep 28, 2024

Uh oh!

kov commented Sep 29, 2024

Uh oh!

kov commented Sep 29, 2024

Uh oh!

oSoMoN left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oSoMoN left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kov commented Sep 25, 2024 •

edited

Loading

codecov bot commented Sep 25, 2024 •

edited

Loading

kov commented Sep 27, 2024 •

edited

Loading

oSoMoN commented Sep 27, 2024 •

edited

Loading