Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 64 bit implementation of roaring.Bitmap #245

Merged
merged 23 commits into from
Jul 16, 2020
Merged

Add 64 bit implementation of roaring.Bitmap #245

merged 23 commits into from
Jul 16, 2020

Conversation

jacobmarble
Copy link

@jacobmarble jacobmarble commented Mar 24, 2020

Fixes #136

This PR is very much a work-in-progress. I've tried to stay with the original 32 bit method signatures as much as possible. Serialization is explicitly not part of this PR.

  • fix failing tests
  • augment tests with 64 bit values
  • verify copy-on-write

Jacob Marble added 2 commits March 23, 2020 18:45
Fixes #136

- many tests fail
- tests have not been augmented with 64 bit values
- serialization is explicitly not part of this PR
- no special attention has been given to copy-on-write concerns
@jacobmarble
Copy link
Author

@lemire do you mind giving this an early review? I want to make sure I'm on the right track.

@lemire
Copy link
Member

lemire commented Mar 29, 2020

@jacobmarble This looks like solid code to me. It ressembles the C++ and Java counterparts of the same design.

Obviously, I can only encourage the addition of more testing, and if you are able, I'd throw in fuzz testing as well.

@coveralls
Copy link

coveralls commented Mar 29, 2020

Coverage Status

Coverage increased (+0.2%) to 82.45% when pulling bc493cc on jacobmarble:jgm-64-bit into 4d53b29 on RoaringBitmap:master.

@kevinconaway
Copy link
Contributor

Any update here?

@lemire
Copy link
Member

lemire commented May 22, 2020

@kevinconaway There are some tests and it looks good, but the tests are not as thorough as one would like. I am nervous about merging this and having people's database blow up. We also could use performance and memory usage measures.

Would you be willing to add new tests and take it out on a spin?

@lemire
Copy link
Member

lemire commented May 29, 2020

This issue #252 might reference this PR.

@guymolinari
Copy link
Contributor

This work is really needed. Can I help in some way?

@lemire
Copy link
Member

lemire commented Jun 16, 2020

@guymolinari Yes. Can you review and test this PR?

There are tests, but I concerned that they may not be thorough enough. We don't want this code to blow up when it goes into production.

@guymolinari
Copy link
Contributor

Will do @lemire. Have a large project with 100s of terabytes of data that will use this.

@jacobmarble
Copy link
Author

Thanks for the chatter here. I wrote what is here so far and then got busy with other work.

@guymolinari
Copy link
Contributor

How do you wan't to handle this @jacobmarble? I will definitely need the serialization stuff and am willing to dig in and do it. Also, more tests as needed. I have another outstanding PR with @lemire so I'm not sure working on my fork would make sense. Never forked another owners fork. Should I go this route?

@jacobmarble
Copy link
Author

You could pull down my branch and layer new commits on top, or cherry pick them into your branch. It would be nice to see my name on my commits, but that's all.

@guymolinari
Copy link
Contributor

@jacobmarble Could you grant write permissions to me for this repo?

@jacobmarble
Copy link
Author

@guymolinari I've added you as a collaborator, see if that works

@guymolinari
Copy link
Contributor

guymolinari commented Jun 16, 2020 via email

@guymolinari
Copy link
Contributor

guymolinari commented Jun 16, 2020 via email

@lemire
Copy link
Member

lemire commented Jun 19, 2020

@guymolinari Keep us posted.

@guymolinari
Copy link
Contributor

Will do @lemire @jacobmarble. Should have an update tomorrow. Next step is to increase test coverage and then should be ready to integrate with my project.

@guymolinari
Copy link
Contributor

I'm checking in test cases and coverage percentage is dropping. I'm going to call it good for now and move on to integrating the 64 bit version into my project and see how it works. We should have functional parity between the legacy and 64 bit APIs.

  1. Serialization should be done. Looking at the C/C++ CRoaring libraries I have tried to follow the established patterns (word sizes, endianness, etc) so that wire protocols and cross language polyglots are preserved.

  2. Added some tests that focus on 64 bit values for better coverage. Unable to fool the static analysis being done by Coverall. :-)

Cheers,
Guy

@lemire @jacobmarble

@guymolinari
Copy link
Contributor

@lemire @jacobmarble The roaring64 implementation is missing the ParOr function so I will have to add that. It is required for my project and also the new PR for the bit slice indexing library. @lemire I guess we have a separate 64 bit version of that as well?

@lemire
Copy link
Member

lemire commented Jun 23, 2020

Great work.

The roaring64 implementation is missing the ParOr function so I will have to add that. It is required for my project

Is that blocking for you, or something you can add later?

I guess we have a separate 64 bit version of that as well?

Can your rephrase?

and also the new PR for the bit slice indexing library.

I am not forgetting that PR, but let us keep things separate for now.

@guymolinari
Copy link
Contributor

guymolinari commented Jun 23, 2020 via email

@guymolinari
Copy link
Contributor

@lemire @jacobmarble I am happy to say that application level testing went well.

I upgraded my system to utilize the new 64 bit library and processed about 1TB of data. I tested under 2 scenarios. Both results were compared against the system of record without issue. Another thing of note. I had to create a 64 bit version of the new BSI library as well as it is also heavily utilized (will submit a separate PR when you are ready @lemire).

  1. The first scenario used 32 bit identifiers. Processing latency was on par with the legacy 32 bit version.

  2. The second scenario involved generating 64 bit identifiers by doing a 32 bit shift left and adding a timestamp. Processing time increased as expected (almost 3x). Not really surprised but I'm wondering if it would make sense to move away from an array of slices ([]*roaring.Bitmap) and use a (map[]*roaring.Bitmap) in order to mitigate the overhead of binarySearch().

Either way I think this is a +1 from my perspective. Additional test coverage would be nice. Also, a few features need to be ported to 64 bit as well.

@lemire
Copy link
Member

lemire commented Jun 27, 2020

@guymolinari Note that I merged the BSI code and then applied maintenance on it.

@lemire
Copy link
Member

lemire commented Jun 27, 2020

@guymolinari Did we test the COW functionality at some point?

@guymolinari
Copy link
Contributor

guymolinari commented Jun 27, 2020 via email

@guymolinari
Copy link
Contributor

guymolinari commented Jun 27, 2020 via email

@guymolinari
Copy link
Contributor

guymolinari commented Jun 29, 2020 via email

@guymolinari guymolinari marked this pull request as ready for review June 30, 2020 13:36
@guymolinari guymolinari changed the title [WIP] feat: add 64 bit implementation of roaring.Bitmap Add 64 bit implementation of roaring.Bitmap Jul 3, 2020
@lemire
Copy link
Member

lemire commented Jul 15, 2020

@guymolinari Any update?

@guymolinari
Copy link
Contributor

guymolinari commented Jul 15, 2020 via email

@lemire
Copy link
Member

lemire commented Jul 15, 2020

@guymolinari If you could just let us know whether you recommend that the PR be merged, that would be great. If you do, I will then review it and hopefully merge it soon after.

Missing features that you do not consider essential are ok. They can come later.

@guymolinari
Copy link
Contributor

guymolinari commented Jul 15, 2020 via email

@lemire
Copy link
Member

lemire commented Jul 16, 2020

It looks good to me. I will merge and possibly do some minor edits.

@lemire lemire merged commit c0e59bb into RoaringBitmap:master Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support 64-bit indices?
6 participants