-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of avx512 bit ops and popcounts #41
Comments
Since its integers the compiler has the liberty to reorder lots of things while in floating point it can't because addition is not asociative so the usual suspects that slow down reductions do not exist. In conclusion like you, I think fast popcnt might be the bottleneck. Note on multithreading: Bench on my CPU
Frequencies: scalar code 4.1 GHz, AVX512 3.5 Ghz (liquid cooling, I think for air-cooled CPU you are probably looking at a 1GHz downclocking). It's important to know if your CPU is a Skylake-X, Xeon-W or Xeon Gold which have 2 AVX512 units per core or a Xeon Bronze / Silver which only have 1. Now regarding popcnt, I did a bit of research and found the following:
I've forked the repo so that on AVX512 it doesn't use software emulation: https://github.com/mratsim/sse-popcount/tree/dont-use-avx512-emulation You can run
Then in terms of implementations, given that there are many variable to track I expect that for your need AVX512 has a lot of advantages over AVX2 even with the downclocking due to 32 AVX512 registers versus only 16 for AVX2. The fastest popcnt on my machine, Harley Seal, needs 12 registers per popcnt https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L39-L45 Details of the implementationsAssuming Harley seal is also the fastest on your machine here are some details:
The popcnt is done 64 by 64 And the remainder byte by byte: https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L90-L91
Here we stay in the m512i domain including for total, you want to have your for loop here: This is unrolled by 16 So you need 4 AVX512 running totals.
|
wow. thanks for taking a look. here is what I see:
As I understand it, the harley-seal will require a substantial change to the code. And I haven't figure out how to keep an m512i accumulating into an array (I think that's what you mean). I guess because the size is unknown... I'll revisit next week. Thanks again. |
You don't need to invent anything, the algorithm accumulate in one total already. You just need to have 4 variables vibs0, vibs2, vnhets, vnN instead of a single total. |
Another biologist exploring the popcount rabbit hole: http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html |
as requested, I am opening an issue.
somalier calculates relatedness between pairs of samples using bitwise operations and popcounts here
where genotypes is effectively:
and currently, those seqs have a
len
of about 300.so for 10K samples, doing relatedness for all-vs-all, it will do ~50 million calls to the IBS function linked above.
the "simple" avx512 version is here which is identical in speed to the version currently in somalier
the unrolled version is here. this gives ~10-15% speedup.
this problem is embarrassingly parallel and it's currently single-threaded, so I could also improve speed that way, but I was interested to explore the simd stuff.
I'd be interested to hear your thoughts, maybe the "parallelization" scheme should be changed to load 8 samples at once instead of current 8 (*64) sites at once.
The text was updated successfully, but these errors were encountered: