Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksum: add FxHash and GxHash based checksum #58

Merged
merged 10 commits into from
Apr 16, 2024

Conversation

fia0
Copy link

@fia0 fia0 commented Apr 11, 2024

This is just to test out the performance compared to the xxhash we've been using until now. Early measurements with 4M blobs have shown that it could be worth experimenting with fxhash (https://lib.rs/crates/rustc-hash) which is used in the rust compiler. Another one worth trying out is ahash (https://lib.rs/crates/ahash) used by hashbrown. Ahash is too unstable.

@fia0 fia0 changed the title checksum: checksum: Optimizing hashing functions Apr 11, 2024
Johannes Wünsche added 2 commits April 12, 2024 13:53
This is just to test out the performance compared to the xxhash we've been
using until now.  Early measurements with 4M blobs have shown that it could be
worth experimenting with fxhash which is used in the rust compiler.
@fia0 fia0 force-pushed the feat/hashing_fun branch from 5672d8b to eef9c7d Compare April 12, 2024 11:54
@fia0
Copy link
Author

fia0 commented Apr 12, 2024

Due to stability concerns i've tested some of the hash impls with a bunch of different compiler versions on different CPUs and indeed ahash is quite unstable, but fxhash remained remarkably stable. Included in the tests are ants epyc7000 (CentOS Stream 8), epyc8000 (CentOS Stream 8), epyc9000 (Rocky 9), xeon gold 5220R (Ubuntu 20.04), a virtual CPU with some xeon skylake below (Fedora 39), a virtual CPU with a neoverse-n1 below (rocky 9) and the work station of yours truly i7 13000 (Fedora 39). They all produce the same hash for a given input under multiple stable and nightly.

@fia0
Copy link
Author

fia0 commented Apr 12, 2024

For a comprehensive overview refer to https://github.com/rurban/smhasher.

Johannes Wünsche added 3 commits April 12, 2024 16:56
Noticed this while grepping for XxHash, seemed to have evaded the cleaning
process some generations ago.
@fia0 fia0 marked this pull request as ready for review April 12, 2024 15:45
@fia0 fia0 changed the title checksum: Optimizing hashing functions checksum: add FxHash-based checksum Apr 12, 2024
@fia0 fia0 changed the title checksum: add FxHash-based checksum checksum: add FxHash and GxHash based checksum Apr 15, 2024
@fia0
Copy link
Author

fia0 commented Apr 15, 2024

Numbers from one of the EPYC 9000 servers (hashing a randomly filled 4MiB slice):

GxHash:
	Minimum: 45441 ns
	50%: 46400 ns
	95%: 47980 ns
	Maximum: 53160 ns
ahash:
	Minimum: 123610 ns
	50%: 126410 ns
	95%: 128561 ns
	Maximum: 135941 ns
XxHash:
	Minimum: 217851 ns
	50%: 221981 ns
	95%: 226151 ns
	Maximum: 246701 ns
Highway:
	Minimum: 320102 ns
	50%: 324492 ns
	95%: 327932 ns
	Maximum: 344701 ns
FxHash:
	Minimum: 708773 ns
	50%: 710374 ns
	95%: 711574 ns
	Maximum: 727214 ns
ZwoHash:
	Minimum: 708794 ns
	50%: 714243 ns
	95%: 717874 ns
	Maximum: 728874 ns
Murmur3:
	Minimum: 3144406 ns
	50%: 3149066 ns
	95%: 3153416 ns
	Maximum: 3623948 ns

This commit required modifying the build context to allow for the AES
optimizations of GxHash. It should not prove to be an issue on the system we
use (x86-64 and maybe ARM64) which I've tested before this commit.
@fia0
Copy link
Author

fia0 commented Apr 15, 2024

The PR should be mostly finished now, I'll perform some tests and then merge it down. From the hashing implementations now in the stack the order in regards to performance seems to be: GxHash > XxHash ~ FxHash. The latter two really are system-dependent, I've had systems in the testbed in which XxHash outperformed FxHash by 2x and some in which FxHash is somewhat faster than XxHash. Although overall, GxHash seems to be No. 1 in each of these machines.

@fia0
Copy link
Author

fia0 commented Apr 15, 2024

The picture is similar for Highway. Which has a mixed performance overall depending on the system.

@fia0
Copy link
Author

fia0 commented Apr 16, 2024

After some benchmarking I can validate the performance difference also in Haura, the differences are most notable when cache pressure is highest, for example on point queries. The figure below shows the performance scaling with XxHash in blue and GxHash in orange.

ycsb_c_comparison

@fia0
Copy link
Author

fia0 commented Apr 16, 2024

Based on the benchmark results, and since there is no data stored beyond ephemeral development stuff in all existing Haura deployments, I took the liberty to assign gxhash as the default checksum. The version for gxhash is pinned to v3.1.1.

@fia0 fia0 merged commit 585b5fa into parcio:main Apr 16, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant