Bloom refactor #39

chris-ha458 · 2023-08-31T03:38:26Z

I have further refactored some Bloom code and tests.

This would help reimplementing or adding alternatives to the current Bloom.
It isolates functions and math related to Bloom that could be shared.

I recognize that this is a significant refactor that might not align to the goals of the developers or project, but i do feel that it improves the readability, maintainability and extensibility of the code.

Although I have some more changes in mind, I bring this forward at this time so to discuss how to proceed (or not to) further.

One thing I did notice during the refactor was a pecularity of pub fn suggest_size_in_bytes
for clarity I extracted the value being compared out as let max_size: usize = usize::MAX / 2; // 9E18 bytes 8exbi-bytes

As per my comments there, that seems like an unrealistic upper bound.
I would think something closer to 1 TiB would be more appropriate, or for it to be configurable.
But I do not know the specific setup that allenai is using internally. So instead of trying to change it myself,
I bring it to your attention.

Also I am now feeling that we could just replace the pub fn suggest_size_in_bytes
with a function that I originally built for testing pub fn simplified_suggest_size
That function looks simpler (subjective) does not rely on loops, mutable variables or outside functions (since we can derive it from first principles regarding bloom maths).
We could still keep the original function (maybe swap places) for testing and comparison purposes as well.

These are the kind of discoveries and (hopefully) changes that I hoped to make through this refactor.

we are rarely interested in the theoretical optimum value, just the returned value from the suggesting function

this doesn't have to be part of the main for the code and they contain quite a bit of complexity and interdependency so I factored them out.

2015aroras

If the eventual aim is to have different bloom filter implementations, then it may be good to first decide what functionality will be common between the filters (if we had a bloom filter 'interface' or trait, what methods would be in it?). Once that is decided, it would probably be easier to make the code readable and extensible.

If we don't want to have different bloom filters, then maybe the maths functions can just stay where they are.

2015aroras · 2023-09-21T23:05:53Z

src/bloom_filter/bloom_test.rs

@@ -1,56 +1,97 @@
 #[cfg(test)]
-mod tests {
-    use super::super::BloomFilter;


Using mod tests seems to be the Rust way of doing unit tests (https://doc.rust-lang.org/rust-by-example/testing/unit_testing.html). Is there a reason you are changing this? With your changes I am not sure that all the functions are covered by #[cfg(test)]

I think that was part of a refactor into separate file.
Unit tests are not required to be in the same file to be valid, but I forgot about those specific changes.
I think it was since refactoring out into a file makes mod tests and the import redundant, but I'll have to check.

2015aroras · 2023-09-21T23:08:01Z

src/bloom_filter/bloom_test.rs

+        // Add more test cases here as needed
+    ];
+
+    for test_case in test_cases {


Doing a loop over test cases means that if the first test case fails, none of the others will run. I think a better approach would be to create a separate method per test, and each such test would create a TestCase and pass it to a base method with the test case logic.

I think it was a rough draft to illustrate potential test cases.
If i understand correctly, the more idiomatic approach would be parametrized tests, but it might require importing a test harness such as criterion

Parametrized tests sound good too.

2015aroras · 2023-09-21T23:23:40Z

src/bloom_filter/bloom_test.rs

-            BloomFilter::suggest_size_in_bytes(expected_elements, desired_false_positive_rate);
-        assert_eq!(suggested_size, 4_194_304);
-        assert_eq!(suggested_size, theoretical_optimum.next_power_of_two())
+            tested_size, simplified_size,


If simplified_suggest_size is an improved form of suggest_size_in_bytes, then it is better not to have it in the unit tests like this. Instead, the implementation of suggest_size_in_bytes should be replaced with the implementation of simplified_suggest_size (probably in a separate change).

i generally agree.

2015aroras · 2023-09-21T23:37:16Z

src/bloom_filter/bloom_math.rs

@@ -0,0 +1,53 @@
+use super::BloomFilter;
+
+impl BloomFilter {


How does moving the maths functions here help with sharing them to other possible implementations in the future? The only way this maths can be reused is if the other bloom filter implementations keep a copy of BloomFilter internally. I don't think it's a good idea to do that.

Maybe you can put the maths logic in a new struct. Then it would make sense for multiple filter implementations to consume the maths logic.

that makes sense as well

chris-ha458 · 2023-09-22T01:41:08Z

As for potential other implementation of bloom, it would be the more conventional lock based bloomfilter.
In many concurrent bloom implementation, all readers AND writers are both conducted behind a mutex.
Due to how from an item standpoint, they'd need to become a writer as soon as they don't find an item, I think it would be hard to correctly avoid that kind of locking. In otherwords, even an Rwlock would be difficult.

Anyway it would be mostly the same except using a Mutex to envelop the bloomfilter.
Most other impls would be either the same or with minimal changes.

chris-ha458 · 2023-09-22T01:52:57Z

I am not sure yet it bff and dolma will be parallel implementations, independent, or one will replace the other. Can anybody shed some light here?

2015aroras · 2023-09-22T20:49:15Z

I am not sure yet it bff and dolma will be parallel implementations, independent, or one will replace the other. Can anybody shed some light here?

@dirkgr Do you have any context on this, or more broadly any idea on the future of the filter in dolma and bff repos?

dirkgr · 2023-09-22T21:26:04Z

BFF and Dolma will be separate for the time being, mainly because we don't have the time to focus on unifying them. In my ideal scenario BFF would be a crate that can run standalone, but also be consumed by Dolma as a library. But unless you, @chris-ha458, want to step up to make that happen, I think we have to keep them separate.

I asked @2015aroras to back-port some of your fixes to BFF because they seem quite important, especially the one about the stable hash functions.

I am not super excited about switching to mutex based locking, unless you can show some numbers that show that it is conclusively faster, even with >256 cores running in parallel. I know the atomic int stuff looks like a lot of code, but it should compile down to a single instruction, no?

chris-ha458 · 2023-09-23T11:41:31Z

I am not super excited about switching to mutex based locking, unless you can show some numbers that show that it is conclusively faster, even with >256 cores running in parallel. I know the atomic int stuff looks like a lot of code, but it should compile down to a single instruction, no?

A mutex based approach would definitely not be faster. Depending on lock contention (which would be expected to be high) using it with multiple threads could even be slower than using a single thread.
It would however, avoid the race mentioned in #33. (Also it was the more common approach to implement a concurrent bloom filter in other rust crates)
The reason to try to implement it here was to provide a "correct" implementation that is only influenced by the thread related effects. The hope is that it could be implemented by simply adding a mutex, but if there is anything more involved to implement than that it would not be very useful to implement considering the development and maintenance costs.

Hope fully such implementation could be (easily) done when BFF becomes a standalone crate that is consumed by dolma and tested either as part of dolma or BFF integration tests.

2015aroras · 2023-09-25T16:10:38Z

Hope fully such implementation could be (easily) done when BFF becomes a standalone crate that is consumed by dolma and tested either as part of dolma or BFF integration tests.

In line with this comment, I think that making BFF a standalone crate that is consumed by dolma should be prioritized over making the implementation thread safe (and maybe even over using a stable hashing implementation).

chris-ha458 added 8 commits August 31, 2023 09:34

algebraic simplification

f047dd7

function encapsulation

922b53e

further isolation

c409219

simplify test

cb90396

we are rarely interested in the theoretical optimum value, just the returned value from the suggesting function

refactor bloom_math

d538b8b

this doesn't have to be part of the main for the code and they contain quite a bit of complexity and interdependency so I factored them out.

bloom_math documenation

3ad7990

implement missing behavior

fafa7f5

simplify with trait function v1.clamp(min,max)

ea447ee

soldni mentioned this pull request Sep 14, 2023

Change bloom_filter implementation of hash #26

Open

dirkgr mentioned this pull request Sep 20, 2023

Backport improvements from the Dolma repo allenai/bff#7

Open

2015aroras reviewed Sep 21, 2023

View reviewed changes

soldni closed this May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloom refactor #39

Bloom refactor #39

chris-ha458 commented Aug 31, 2023

2015aroras left a comment

2015aroras Sep 21, 2023

chris-ha458 Sep 22, 2023

2015aroras Sep 21, 2023

chris-ha458 Sep 22, 2023

2015aroras Sep 22, 2023

2015aroras Sep 21, 2023

chris-ha458 Sep 22, 2023

2015aroras Sep 21, 2023

chris-ha458 Sep 22, 2023

chris-ha458 commented Sep 22, 2023

chris-ha458 commented Sep 22, 2023

2015aroras commented Sep 22, 2023

dirkgr commented Sep 22, 2023

chris-ha458 commented Sep 23, 2023

2015aroras commented Sep 25, 2023 •

edited by dirkgr

Loading

Bloom refactor #39

Bloom refactor #39

Conversation

chris-ha458 commented Aug 31, 2023

2015aroras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-ha458 commented Sep 22, 2023

chris-ha458 commented Sep 22, 2023

2015aroras commented Sep 22, 2023

dirkgr commented Sep 22, 2023

chris-ha458 commented Sep 23, 2023

2015aroras commented Sep 25, 2023 • edited by dirkgr Loading

2015aroras commented Sep 25, 2023 •

edited by dirkgr

Loading