Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize zarr_checksum into folder_checksum with a set of format-specific (e.g. zarr) adapters? #50

Open
yarikoptic opened this issue Jan 3, 2024 · 2 comments

Comments

@yarikoptic
Copy link
Member

Inspired by the discussion/analysis in https://github.com/dandi/dandi-cli/pull/1371/files#r1420705289.

zarr_checksum seems to be not zarr specific at all! It just estimates a checksum over a hierarchy of directories/files given a "walker" generator thus could be generalized for any hierarchy checksumming given a "walker". Moreover it is already implemented that way pretty much given the FileGenerator interface and yield_files_s3 and yield_files_local implementations. There is even now

Then it would become relevant to fscache where we also do folders checksumming , and somehow threaded implementation ended up slowing things down: https://github.com/con/fscacher/pull/67/files .

@jwodder
Copy link
Member

jwodder commented Jan 4, 2024

What would be the point of generalizing the code? What do you want to use it for that requires generalization?

If you want to use this for fscacher, then (assuming the code is generalized to the point that we can reproduce fscacher's current directory-fingerprinting algorithm with it), what would we gain from doing so?

This just seems like a big YAGNI to me.

@yarikoptic
Copy link
Member Author

You might as well be right @jwodder. As you know I dislike code duplication (duplication of effort/bugs/...), and this feels like the right target for the reason that we need directories fingerprinting now in multiple cases (at least for zarr and fscacher ATM), and all of them could benefit from a clean and efficient "traverse + fingerprint" implementation.

Initial motivation was also simply realization that there is really not much (if anything) of zarr specific (hence #41) in this library, just our custom way to get checksum over a hierarchy which is reminiscent of AWS Digest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants