Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor of MinHash to work with a single class and fix the shelve ba…
…ckend (#937) * Initial work for minhash * Add minhash step redirect * Add first version of minhash and minhashlsh * Add unit tests for minhash dedup * Add pipeline testing deduplication * Add tests to run with disk backend * Add tests for the disk and ensure unload * Add private _datasketch module to include a custom storage configuration for the minhash index * Add docstrings to the internal classes/functions * Add docstrings for the user facing classes * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update tests/integration/test_deduplication.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Add installation dependencies * Apply comments from code review * Add nltk as a dependency for the tests * Update tests and interpretation of keep rows vs duplicates * Remove disk backend from tests temporarily * Add note in the docs related to minhash storage on disk * Update tests to run on dict instead of disk as it never ends on CI * Fix integration test * Hide import inside of function to avoid installing it on docs building * Update command to download nltk * Allow for a name in the shelve based backend to avoid overwrites * Refactor MinHash to use a single MinHashDedup class that controls all the process * Refactor tests to use the new class * Redirect import to steps level * Create new disk based storage using diskcache * Add docstrings to clarify the difference between dict/disk * Refactor to use diskcache * Fix docstring example * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update definition of the step --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
- Loading branch information