-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize modulo calculation #214
base: master
Are you sure you want to change the base?
Conversation
changing this order will make it much faster especially for larger arrays
re : #212 How do you want me to handle xxhash importing? |
I'll focus on general minhash improvements here instead of alternative hashing schemes for this pr |
Thanks! Does this change lead to different hash values? |
See my comment in: #212 (comment). We let user choose their hash functions through MinHash's init parameter. |
Thanks! I think as a next step, we can think about a good abstraction for users to specify the permutation calculation. |
i have been working on this on a private branch (as well as another implementation with different project goals) but i am having some difficulty making the changes while respecting the needs and goals of this repo. Do you prefer I still make the commits directly on minhash.py or maybe show it to you separately? |
Thanks for your contribution. I think whichever way works for you is the best. Please take my comments as suggestion only. |
This shouldn't diverge but they do.
This reverts commit 14e54ba.
although the intermediate calculations are meaningful to be calculated with higher dtypes, the final calculated hashes should be stored and operated upon with the same amount of bits as the original has (32bits)
I don't know why i've operated on the assumption that anyway, even without that assumption there are improvements to be made. since we start with sha1_32hash which only has 32 bits, and ensure that the resulting numbers have only 32 bits or less (np.bitwise_and with maxhash) there is no need for the numbers to keep np.uint64 after permutations have been done. I think i can also get away with having a,b permutation array as np.uint64 but I put them into separate commits. |
initial value/mask can definitely be np.uint32 as well
maxhash appropriate sized datatype
As you can see this is very rudimentary. There are some assumptions that are practical yet imperfect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I am wondering if we can have a "playground" inside test
to verify the equivalance of different permutation code. e.g., the old version vs. the new one but result in the same hash values.
We can focus on the performance improvement over existing permutation calculation in this PR. We can open a separate PR for customization and decide there. |
Focusing on performance improvements sounds nice. If these changes prove fruitful we can explore customizations options as necessary. As for playground that sounds like a good idea.
It might be a better idea to pause performance improvements until we have a good benchmark to compare with. That is a much larger effort though, and I don't know how you would like to proceed. Do you have any thoughts or pointers on how to implement a benchmark here? |
why use the mersenne prime if we are not going to use the mersenne trick?
so. this needs some explanation. The reason mersenne primes are used in the first place is that they have a very unique property Thanks to modern vectorized instructions and byte based primitives this is less true. If you feel that this code is too ugly, unperformant, unmaintainable, etc that is 100% fine. Then, my argument becomes : there is no reason to use mersenne primes in that case. we can use larger primes |
We have a microbenchmark for MinHash in the |
This is an interesting case to make. I think we can benefit from a simple test module in As for the code change, I think if it helps with performance to use this new trick over |
I have thought about finding a way to preserve backward compatibility and performance through selections, but it seems very difficult. It is not that the technical side is difficult, but the code becomes very difficult to read and understand what is happening. I think that is why i return to my original proposal. we keep the backwards compatible Meanwhile, I will try to implement the |
Thanks for your input. This is very valuable. I think we can choose simple alternative option in MinHash constructor (e.g., This way there is only two code paths: (1) one code path to aggressively adapt to the latest permutation trick, without worrying about backward compatibility; and (2) classic minhash permutation algorithm, backward compatible, but performance optimized. In the documentation we can explain the option as breaking changes for older minhashes -- "use with care". |
@chris-ha458 are you still interested in this PR? |
changing this order will make it much faster especially for larger arrays