Consolidate SSE optimized shuffle with Blosc. #11

kiyo-masui · 2014-08-31T00:07:47Z

Both Bitshuffle and Blosc/c-blosc@b37ca0b implement optimized (SSE2) versions of shuffle for 16, 32, and 64 bit element sizes. In Bitshuffle these routines are bshuf_trans_byte_elem_*. The operation counts for the two implementations appear to be the same be we should check which versions are fastest and consolidate them.

Bitshuffle also has optimized code for if the elements size is a multiple of 32 or 64 bits, which is useful for compound data types and could benefit Blosc.

The text was updated successfully, but these errors were encountered:

kif · 2019-11-08T13:27:10Z

To emphasize on this, the bitunshuffle uses an extra temporary buffer betweed the byte-transpose and the bitwise transpose. I believe this reduces drastically the performances when the buffer size increases
https://github.com/kiyo-masui/bitshuffle/blob/master/src/bitshuffle_core.c#L1297

I did some alternative implementation which use a limited number of vectors (8, 16, 32 or 64) for temporary storage. This allows to make best use of the L1 cache instead of going to the L3 cache which is usually shared between the cores.

kiyo-masui · 2019-11-08T13:55:12Z

Keep in mind that the whole operation is blocked in small blocks of 8k bytes, which does fit in cache.

kif · 2019-11-08T18:27:19Z

On Fri, 08 Nov 2019 05:55:13 -0800 Kiyoshi Masui ***@***.***> wrote: Keep in mind that the whole operation is blocked in small blocks of 8k bytes, which does fit in cache.

I understund that this 8kB limit is indeed imposed by the L1 cache size (~32kB/3). My benchmarks may be biased but apparently with larger chunk size (~L3/3) the performances are reaching a plateau without drop as observed in some cases: https://user-images.githubusercontent.com/1018880/68398473-15658e80-0175-11ea-9225-257448de9104.png https://www.hdfgroup.org/wp-content/uploads/2019/09/2019-09-20-Power9HDF5.pdf slide 18. I have the feeling the SSE2 and AVX implementations could also win with this approach. Bitshuffling may be "neglectable" compared to the cost of lz4 on x86 hardware, but the Power9 has hardware compression for gzip and shuffling bits is then clearly the limiting factor.

kiyo-masui · 2019-11-08T18:42:18Z

Looking at your slides, you are testing the version of bitshuffle that ships with blosc? Can you post some test code?

kif · 2019-11-08T20:46:49Z

Looking at your slides, you are testing the version of bitshuffle that ships with blosc?

yes, blosc2 beta to be precise.

Can you post some test code?

For the blosc code, I have to expose manually the different function which I want to test and call them with ctypes. For your package, it is not needed as it is directly exposed in python via cython. In both cases, the overhead due to python remains an unknown, I only know it is much higher in ppc64le than in x86 (probably 2x). I had a look at the two SSE2 code for bitunshuffle and to me, they differ only in a malloc for the temporary buffer. Here is the benchmarking code I used, taking benefit of the jupyter kernel facility: ``` def benchmark(mini=10, maxi=25, dtype="uint8"): res = {} for i in range(mini, maxi): size = 1<<i data = numpy.random.randint(0, 255, size=size).astype(dtype) bs = %timeit -o bitshuffle.bitshuffle(data, data.nbytes) bu = %timeit -o bitshuffle.bitunshuffle(data, data.nbytes) res[size] = (data.nbytes/bs.best/1e9, data.nbytes/bu.best/1e9) return res ```

kiyo-masui added the enhancement label Aug 31, 2014

SujitC mentioned this issue Dec 10, 2020

error while running bitshuffle dynamic plugin #89

Closed

DimitriPapadopoulos mentioned this issue May 23, 2023

Revert previous changes to vendored code Blosc/c-blosc2#512

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate SSE optimized shuffle with Blosc. #11

Consolidate SSE optimized shuffle with Blosc. #11

kiyo-masui commented Aug 31, 2014

kif commented Nov 8, 2019

kiyo-masui commented Nov 8, 2019

kif commented Nov 8, 2019 via email

kiyo-masui commented Nov 8, 2019

kif commented Nov 8, 2019 via email

Consolidate SSE optimized shuffle with Blosc. #11

Consolidate SSE optimized shuffle with Blosc. #11

Comments

kiyo-masui commented Aug 31, 2014

kif commented Nov 8, 2019

kiyo-masui commented Nov 8, 2019

kif commented Nov 8, 2019 via email

kiyo-masui commented Nov 8, 2019

kif commented Nov 8, 2019 via email