Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom refactor #39

Closed
wants to merge 8 commits into from
Closed

Bloom refactor #39

wants to merge 8 commits into from

Conversation

chris-ha458
Copy link
Contributor

I have further refactored some Bloom code and tests.

This would help reimplementing or adding alternatives to the current Bloom.
It isolates functions and math related to Bloom that could be shared.

I recognize that this is a significant refactor that might not align to the goals of the developers or project, but i do feel that it improves the readability, maintainability and extensibility of the code.

Although I have some more changes in mind, I bring this forward at this time so to discuss how to proceed (or not to) further.

One thing I did notice during the refactor was a pecularity of pub fn suggest_size_in_bytes
for clarity I extracted the value being compared out as let max_size: usize = usize::MAX / 2; // 9E18 bytes 8exbi-bytes

As per my comments there, that seems like an unrealistic upper bound.
I would think something closer to 1 TiB would be more appropriate, or for it to be configurable.
But I do not know the specific setup that allenai is using internally. So instead of trying to change it myself,
I bring it to your attention.

Also I am now feeling that we could just replace the pub fn suggest_size_in_bytes
with a function that I originally built for testing pub fn simplified_suggest_size
That function looks simpler (subjective) does not rely on loops, mutable variables or outside functions (since we can derive it from first principles regarding bloom maths).
We could still keep the original function (maybe swap places) for testing and comparison purposes as well.

These are the kind of discoveries and (hopefully) changes that I hoped to make through this refactor.

we are rarely interested in the theoretical optimum value, just the returned value from the suggesting function
this doesn't have to be part of the main for the code and they contain quite a bit of complexity and interdependency so I factored them out.
Copy link

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the eventual aim is to have different bloom filter implementations, then it may be good to first decide what functionality will be common between the filters (if we had a bloom filter 'interface' or trait, what methods would be in it?). Once that is decided, it would probably be easier to make the code readable and extensible.

If we don't want to have different bloom filters, then maybe the maths functions can just stay where they are.

@@ -1,56 +1,97 @@
#[cfg(test)]
mod tests {
use super::super::BloomFilter;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using mod tests seems to be the Rust way of doing unit tests (https://doc.rust-lang.org/rust-by-example/testing/unit_testing.html). Is there a reason you are changing this? With your changes I am not sure that all the functions are covered by #[cfg(test)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was part of a refactor into separate file.
Unit tests are not required to be in the same file to be valid, but I forgot about those specific changes.
I think it was since refactoring out into a file makes mod tests and the import redundant, but I'll have to check.

// Add more test cases here as needed
];

for test_case in test_cases {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a loop over test cases means that if the first test case fails, none of the others will run. I think a better approach would be to create a separate method per test, and each such test would create a TestCase and pass it to a base method with the test case logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a rough draft to illustrate potential test cases.
If i understand correctly, the more idiomatic approach would be parametrized tests, but it might require importing a test harness such as criterion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parametrized tests sound good too.

BloomFilter::suggest_size_in_bytes(expected_elements, desired_false_positive_rate);
assert_eq!(suggested_size, 4_194_304);
assert_eq!(suggested_size, theoretical_optimum.next_power_of_two())
tested_size, simplified_size,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If simplified_suggest_size is an improved form of suggest_size_in_bytes, then it is better not to have it in the unit tests like this. Instead, the implementation of suggest_size_in_bytes should be replaced with the implementation of simplified_suggest_size (probably in a separate change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i generally agree.

@@ -0,0 +1,53 @@
use super::BloomFilter;

impl BloomFilter {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does moving the maths functions here help with sharing them to other possible implementations in the future? The only way this maths can be reused is if the other bloom filter implementations keep a copy of BloomFilter internally. I don't think it's a good idea to do that.

Maybe you can put the maths logic in a new struct. Then it would make sense for multiple filter implementations to consume the maths logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense as well

@chris-ha458
Copy link
Contributor Author

As for potential other implementation of bloom, it would be the more conventional lock based bloomfilter.
In many concurrent bloom implementation, all readers AND writers are both conducted behind a mutex.
Due to how from an item standpoint, they'd need to become a writer as soon as they don't find an item, I think it would be hard to correctly avoid that kind of locking. In otherwords, even an Rwlock would be difficult.

Anyway it would be mostly the same except using a Mutex to envelop the bloomfilter.
Most other impls would be either the same or with minimal changes.

@chris-ha458
Copy link
Contributor Author

I am not sure yet it bff and dolma will be parallel implementations, independent, or one will replace the other. Can anybody shed some light here?

@2015aroras
Copy link

I am not sure yet it bff and dolma will be parallel implementations, independent, or one will replace the other. Can anybody shed some light here?

@dirkgr Do you have any context on this, or more broadly any idea on the future of the filter in dolma and bff repos?

@dirkgr
Copy link
Member

dirkgr commented Sep 22, 2023

BFF and Dolma will be separate for the time being, mainly because we don't have the time to focus on unifying them. In my ideal scenario BFF would be a crate that can run standalone, but also be consumed by Dolma as a library. But unless you, @chris-ha458, want to step up to make that happen, I think we have to keep them separate.

I asked @2015aroras to back-port some of your fixes to BFF because they seem quite important, especially the one about the stable hash functions.

I am not super excited about switching to mutex based locking, unless you can show some numbers that show that it is conclusively faster, even with >256 cores running in parallel. I know the atomic int stuff looks like a lot of code, but it should compile down to a single instruction, no?

@chris-ha458
Copy link
Contributor Author

I am not super excited about switching to mutex based locking, unless you can show some numbers that show that it is conclusively faster, even with >256 cores running in parallel. I know the atomic int stuff looks like a lot of code, but it should compile down to a single instruction, no?

A mutex based approach would definitely not be faster. Depending on lock contention (which would be expected to be high) using it with multiple threads could even be slower than using a single thread.
It would however, avoid the race mentioned in #33. (Also it was the more common approach to implement a concurrent bloom filter in other rust crates)
The reason to try to implement it here was to provide a "correct" implementation that is only influenced by the thread related effects. The hope is that it could be implemented by simply adding a mutex, but if there is anything more involved to implement than that it would not be very useful to implement considering the development and maintenance costs.

Hope fully such implementation could be (easily) done when BFF becomes a standalone crate that is consumed by dolma and tested either as part of dolma or BFF integration tests.

@2015aroras
Copy link

2015aroras commented Sep 25, 2023

Hope fully such implementation could be (easily) done when BFF becomes a standalone crate that is consumed by dolma and tested either as part of dolma or BFF integration tests.

In line with this comment, I think that making BFF a standalone crate that is consumed by dolma should be prioritized over making the implementation thread safe (and maybe even over using a stable hashing implementation).

@soldni soldni closed this May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants