-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change bloom_filter implementation of hash #26
Comments
Tagging @dirkgr since bloom filter is his baby. |
I did not know this about ahash. This bloom filter is the first program I ever wrote in Rust, and it probably shows. I would love to move to a stable hash. |
If that is the case, please take a look into my previous PRs (#24 and especially #23) and take your time integrating them in a manner that you see fit. Most of my commits from #23 comes from the outputs of In my experience the results of Going back to the subject, please review the changes in the PR #23. If they are found to be acceptable, |
#24 was pulled but I feel like a bit more is necessary so I did #31. In adding tests into To share specific ideas on the implementation itself, I plan to integrate either Some more decisions need to be made regarding scalableBloomFilters but this isn't something I can just hackaway alone since it would require architectural decisions, but lets cross that bridge when we arrive there. |
A bunch of these have been merged, but this one is still open. What's next for this PR? |
All of these improvements should also be merged into the standalone tool at https://github.com/allenai/bff. |
Another block is #33 I might be wrong about the potential race condition, or I am right but allenai considers current behavior as "good enough" considering the already fuzzy nature of bloom. So i have been trying to replicate some pipeline work with CC release slices so as to test and approach the pipeline on my own, but it has been stalled for a while.
This is the first time I heard of this and will look into that codebase as well |
I am okay with merging in #39, but yeah the goal is to eventually merge back into the BFF repo and release that as a crate. @dirkgr given your schedule, do you have thoughts on letting implementations (dolma vs bff) diverge for a while? @chris-ha458 are you blocked on #39 not being merged and/or other issues with the rust code? |
I am not blocked on #39. That can wait. i've given a cursory look into bff but it seems to follow a very similar implementation that is found here. If this is known and accepted (I disagree, but there are cases such as Guava where they accept the possibility of race) I would want to see testing suite (either built by allenai or me) to validate when it happens and by how much that is the only way to tell if it is "Acceptable". This goes back to my attempts at trying to replicate the pipleline on my device. |
If you don't mind fixing the racing condition, I'll be happy to merge a PR. I don't think we have strong opinions about which solution to use, so I'm happy for you to call it as you see fit :) For an end to end example, there are two guides I can point you to at the moment. They use an older, internal codebase (allenai/LLM) that we are slowly partitioning out into more user-friendly public code bases. The CLI and configuration syntax has some major differences, but they are a good starting point to figure out how to get around.
I'm happy to chat on discord if you have questions. The plan is to refresh them for dolma over the weekend or next week. |
This is now an old thread, but I just noticed this, and for @soldni 's benefit, I think I have to address the race condition: It is by design, and in fact key to the way BFF achieves its performance. The result of the race condition is not a crash or data corruption, it is just non-deterministic when it decides which duplicate to keep and which duplicate to throw away. In very rare cases it might keep one duplicate that should have been thrown out. Not a big deal for the performance gain you get. |
Currently, bloom_filter.rs implements ahash for the internal hasher.
This is problematic since ahash has an unstable representation:
I would love to learn if the dolma developers have found a way to serialize it in a way that maintains some kind of portability, but that is not a supported use case and I feel there is benefit in moving to a stable hash.
Recommendations
The text was updated successfully, but these errors were encountered: