Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deflate performance #18

Open
folkertdev opened this issue Feb 1, 2024 · 7 comments
Open

fix deflate performance #18

folkertdev opened this issue Feb 1, 2024 · 7 comments

Comments

@folkertdev
Copy link
Collaborator

we are consistently ~10% slower than zlib-ng on deflate. This fluctuates with the different compression levels, but currently none of then are on-par with zlib-ng.

There is so far no obvious reason for this slowdown, so it's likely a "death by a thousand papercuts" sort of thing.

@folkertdev
Copy link
Collaborator Author

as a data point, commit 8048180

Benchmark 1 (29 runs): cargo run --release --example compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           175ms ± 8.58ms     168ms …  217ms          1 ( 3%)        0%
  peak_rss           26.8MB ± 87.3KB    26.6MB … 27.0MB          0 ( 0%)        0%
  cpu_cycles          515M  ± 6.88M      505M  …  529M           0 ( 0%)        0%
  instructions        744M  ± 35.3K      744M  …  744M           1 ( 3%)        0%
  cache_references   11.3M  ± 1.18M     9.08M  … 14.1M           0 ( 0%)        0%
  cache_misses       2.25M  ± 66.4K     2.12M  … 2.40M           1 ( 3%)        0%
  branch_misses      4.13M  ± 15.1K     4.10M  … 4.17M           0 ( 0%)        0%
Benchmark 2 (27 runs): cargo run --release --example compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           191ms ± 4.14ms     187ms …  207ms          1 ( 4%)        💩+  9.2% ±  2.1%
  peak_rss           26.7MB ± 93.5KB    26.6MB … 26.9MB          0 ( 0%)          -  0.3% ±  0.2%
  cpu_cycles          580M  ± 10.7M      570M  …  619M           1 ( 4%)        💩+ 12.7% ±  0.9%
  instructions        873M  ± 31.4K      873M  …  873M           0 ( 0%)        💩+ 17.3% ±  0.0%
  cache_references   11.4M  ± 1.16M     9.31M  … 13.6M           0 ( 0%)          +  0.9% ±  5.6%
  cache_misses       2.33M  ± 74.4K     2.22M  … 2.49M           0 ( 0%)        💩+  3.6% ±  1.7%
  branch_misses      4.39M  ± 42.1K     4.34M  … 4.52M           1 ( 4%)        💩+  6.2% ±  0.4%

@bjorn3
Copy link
Collaborator

bjorn3 commented Oct 14, 2024

#223 should have helped a fair bit.

@folkertdev
Copy link
Collaborator Author

yes, but there is still a significant gap (note that that benchmark runs cargo, that's why the numbers are higher)

Benchmark 1 (68 runs): target/release/examples/blogpost-compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.9ms ± 1.02ms    72.7ms … 80.2ms          3 ( 4%)        0%
  peak_rss           26.6MB ± 42.5KB    26.5MB … 26.6MB          8 (12%)        0%
  cpu_cycles          264M  ± 1.79M      261M  …  270M           2 ( 3%)        0%
  instructions        460M  ±  266       460M  …  460M           1 ( 1%)        0%
  cache_references   18.9M  ±  233K     18.5M  … 19.7M           5 ( 7%)        0%
  cache_misses        507K  ± 85.4K      408K  …  949K           2 ( 3%)        0%
  branch_misses      3.32M  ± 5.96K     3.30M  … 3.33M           0 ( 0%)        0%
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.6ms ±  990us    77.4ms … 83.3ms          5 ( 8%)        💩+  6.5% ±  0.5%
  peak_rss           26.7MB ± 65.8KB    26.6MB … 26.7MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          287M  ± 4.11M      284M  …  310M           9 (14%)        💩+  8.8% ±  0.4%
  instructions        633M  ±  276       633M  …  633M           0 ( 0%)        💩+ 37.7% ±  0.0%
  cache_references   19.9M  ±  200K     19.7M  … 20.6M           6 ( 9%)        💩+  5.5% ±  0.4%
  cache_misses        440K  ± 95.8K      327K  …  778K           1 ( 2%)        ⚡- 13.3% ±  6.1%
  branch_misses      3.08M  ± 10.8K     3.06M  … 3.11M           9 (14%)        ⚡-  7.3% ±  0.1%

@brian-pane
Copy link

In case it's helpful for anyone else, here's a CPU profile of the current main. I used https://github.com/mstange/samply to visualize it, since that tool shows the cost of inlined functions. (Sorry for sharing this in the form of a screenshot. I haven't yet found a way to export the output of the tool as plain text.)

Methodology:

perf record -F max -e instructions -o perf-rs.data target/release/examples/compress 1 rs silesia-small.tar
samply import perf-rs.data

Notes on the profile results:

  • A surprisingly large amount of CPU time is spent turning the WeakSlices into temporary slices (all the WeakSliceMut functions, plus Window::filled). But I looked at the generated assembly for these functions, and there's nothing surprising there: it's just doing a pair of 64-bit read operations to fetch the pointer and length fields from the appropriate structs.
  • BitWriter::emit_dist takes a lot of time, but that seems to be because of all the table lookups it needs to do. I haven't found any obvious ways to speed that up.
  • Crc32HashCalc::quick_insert_string is a hotspot. From a source line level profile, though, almost all the "self" time of that function seems to be spent in these two lines, which may be tricky to improve because they're just doing a straightforward array read and an unavoidable conditional branch:
        let head = state.head.as_slice()[hm];
        if head != string as u16 {

profile

@folkertdev
Copy link
Collaborator Author

That corresponds well with what we've been seeing.

A surprisingly large amount of CPU time is spent turning the WeakSlices into temporary slices

You're measuring instructions right, not cpu time? Or maybe I'm misinterpreting? Anyway, if it is just instructions this might be a bit misleading because of instruction-level parallelism?

BitWriter::emit_dist takes a lot of time

Yeah I've tried to turn those tables into integer constants and use bit shifting/masking before, but that did not help. I think i only tried it with simd registers though, maybe using standard registers would work?

It really seems like something should be possible here, but I've not had much luck so far.

Crc32HashCalc::quick_insert_string is a hotspot

besides what you wrote, the read is basically a hashmap lookup, so it is quite likely that the memory you want is not in (L1) cache.

@brian-pane
Copy link

When I collect the profile as CPU cycles instead of instructions, the WeakSlice methods and Window::filled are a smaller percentage, but still collectively over 10% of the total.

@folkertdev
Copy link
Collaborator Author

just want to document this for the future.

I tried replacing this table

#[rustfmt::skip]
pub const LENGTH_CODE: [u8; STD_MAX_MATCH-STD_MIN_MATCH+1] = [
     0,  1,  2,  3,  4,  5,  6,  7,  8,  8,  9,  9, 10, 10, 11, 11, 12, 12, 12, 12,
    13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16,
    17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19,
    19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
    21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22,
    22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
    23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
    24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
    25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
    25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26,
    26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
    26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
    27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28
];

with a function

#[inline(always)]
pub const fn length_code(i: u8) -> u8 {
    match i {
        0..8 => i,
        255 => 28,
        _ => {
            // let log2 = i.ilog(2) // this is `7 - i.leading_zeros()`
            // ((log2 as u8 - 1) << 2) | ((i >> (log2 - 2)) & 0b11)
            let a = (6 - i.leading_zeros() as u8) << 2;
            let b = (i >> (5u8.saturating_sub(i.leading_zeros() as u8))) & 0b11;

            a | b
        }
    }
}

Yes that is inscrutable and cursed, and took forever to figure out. But it actually replicates the table. Sadly it still generates too many instructions to be worth it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants