-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate SSE optimized shuffle with Blosc. #11
Comments
To emphasize on this, the bitunshuffle uses an extra temporary buffer betweed the byte-transpose and the bitwise transpose. I believe this reduces drastically the performances when the buffer size increases I did some alternative implementation which use a limited number of vectors (8, 16, 32 or 64) for temporary storage. This allows to make best use of the L1 cache instead of going to the L3 cache which is usually shared between the cores. |
Keep in mind that the whole operation is blocked in small blocks of 8k bytes, which does fit in cache. |
On Fri, 08 Nov 2019 05:55:13 -0800 Kiyoshi Masui ***@***.***> wrote:
Keep in mind that the whole operation is blocked in small blocks of 8k bytes, which does fit in cache.
I understund that this 8kB limit is indeed imposed by the L1 cache size (~32kB/3).
My benchmarks may be biased but apparently with larger chunk size
(~L3/3) the performances are reaching a plateau without drop as observed in some cases:
https://user-images.githubusercontent.com/1018880/68398473-15658e80-0175-11ea-9225-257448de9104.png
https://www.hdfgroup.org/wp-content/uploads/2019/09/2019-09-20-Power9HDF5.pdf slide 18.
I have the feeling the SSE2 and AVX implementations could also win with this approach.
Bitshuffling may be "neglectable" compared to the cost of lz4 on x86
hardware, but the Power9 has hardware compression for gzip and
shuffling bits is then clearly the limiting factor.
|
Looking at your slides, you are testing the version of bitshuffle that ships with blosc? Can you post some test code? |
Looking at your slides, you are testing the version of bitshuffle that ships with blosc?
yes, blosc2 beta to be precise.
Can you post some test code?
For the blosc code, I have to expose manually the different function
which I want to test and call them with ctypes. For your package, it is
not needed as it is directly exposed in python via cython. In both
cases, the overhead due to python remains an unknown, I only know it is
much higher in ppc64le than in x86 (probably 2x).
I had a look at the two SSE2 code for bitunshuffle and to me, they differ only
in a malloc for the temporary buffer.
Here is the benchmarking code I used, taking benefit of the jupyter
kernel facility:
```
def benchmark(mini=10, maxi=25, dtype="uint8"):
res = {}
for i in range(mini, maxi):
size = 1<<i
data = numpy.random.randint(0, 255, size=size).astype(dtype)
bs = %timeit -o bitshuffle.bitshuffle(data, data.nbytes)
bu = %timeit -o bitshuffle.bitunshuffle(data, data.nbytes)
res[size] = (data.nbytes/bs.best/1e9, data.nbytes/bu.best/1e9)
return res
```
|
Both Bitshuffle and Blosc/c-blosc@b37ca0b implement optimized (SSE2) versions of shuffle for 16, 32, and 64 bit element sizes. In Bitshuffle these routines are
bshuf_trans_byte_elem_*
. The operation counts for the two implementations appear to be the same be we should check which versions are fastest and consolidate them.Bitshuffle also has optimized code for if the elements size is a multiple of 32 or 64 bits, which is useful for compound data types and could benefit Blosc.
The text was updated successfully, but these errors were encountered: