-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deflate performance #18
Comments
as a data point, commit 8048180
|
#223 should have helped a fair bit. |
yes, but there is still a significant gap (note that that benchmark runs
|
In case it's helpful for anyone else, here's a CPU profile of the current Methodology:
Notes on the profile results:
|
That corresponds well with what we've been seeing.
You're measuring instructions right, not cpu time? Or maybe I'm misinterpreting? Anyway, if it is just instructions this might be a bit misleading because of instruction-level parallelism?
Yeah I've tried to turn those tables into integer constants and use bit shifting/masking before, but that did not help. I think i only tried it with simd registers though, maybe using standard registers would work? It really seems like something should be possible here, but I've not had much luck so far.
besides what you wrote, the read is basically a hashmap lookup, so it is quite likely that the memory you want is not in (L1) cache. |
When I collect the profile as CPU cycles instead of instructions, the |
just want to document this for the future. I tried replacing this table #[rustfmt::skip]
pub const LENGTH_CODE: [u8; STD_MAX_MATCH-STD_MIN_MATCH+1] = [
0, 1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 12, 12,
13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16,
17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19,
19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26,
26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28
]; with a function #[inline(always)]
pub const fn length_code(i: u8) -> u8 {
match i {
0..8 => i,
255 => 28,
_ => {
// let log2 = i.ilog(2) // this is `7 - i.leading_zeros()`
// ((log2 as u8 - 1) << 2) | ((i >> (log2 - 2)) & 0b11)
let a = (6 - i.leading_zeros() as u8) << 2;
let b = (i >> (5u8.saturating_sub(i.leading_zeros() as u8))) & 0b11;
a | b
}
}
} Yes that is inscrutable and cursed, and took forever to figure out. But it actually replicates the table. Sadly it still generates too many instructions to be worth it |
we are consistently ~10% slower than zlib-ng on deflate. This fluctuates with the different compression levels, but currently none of then are on-par with zlib-ng.
There is so far no obvious reason for this slowdown, so it's likely a "death by a thousand papercuts" sort of thing.
The text was updated successfully, but these errors were encountered: