Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add more hashing algorithms to mergefs.dedup #147

Open
donmor opened this issue Apr 23, 2024 · 4 comments
Open

Feature Request: Add more hashing algorithms to mergefs.dedup #147

donmor opened this issue Apr 23, 2024 · 4 comments

Comments

@donmor
Copy link

donmor commented Apr 23, 2024

Add an option: -H, --hashing-algorithm= used along with -i, --ignore. Thus we can use faster algorithms like CRC32, or safer one like sha256, or multiple algorithms in turn (skip latter if former is different)

@donmor
Copy link
Author

donmor commented Apr 23, 2024

#148 is an implementation.

@trapexit
Copy link
Owner

The speed of a hash function is rarely an issue. The tool is IO bound most of the time. Have you done any benchmarking?

@donmor
Copy link
Author

donmor commented Apr 23, 2024

I'd do it later.

@donmor
Copy link
Author

donmor commented Apr 24, 2024

Made some modifications to #148 , making it way faster to use same-hash by calling short_hashes_all before hashing each file.

Before:

$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m14.265s
user    0m13.363s
sys     0m0.900s

After:

$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m6.724s
user    0m6.286s
sys     0m0.432s

MD5 / SHA1 is considered unsafe, so it may use SHA256 (slower):

$ time mergerfs.dedup -v --ignore=same-hash --hash=sha256 /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m16.079s
user    0m15.569s
sys     0m0.500s

Sometimes there can be very few bits corrupted in a file, leaking it from the random sampling of short_hash_file. A --hash=crc32 can be specified before --hash=sha256 as acceleration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants