Skip to content

Commit

Permalink
Now with AVX-512 compatibility
Browse files Browse the repository at this point in the history
  • Loading branch information
lemire committed Aug 25, 2016
1 parent 88eddb0 commit e685bff
Show file tree
Hide file tree
Showing 3 changed files with 759 additions and 541 deletions.
236 changes: 227 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
# dictionary
Experiments with dictionary coding

Status: at this point, this is just a technology demo to see what might be possible.
This repository might evolve, later, into something that's useful. Ideas, contributions,
criticism, collaboration are invited. (Please don't use this code in production.)
Experiments with high-performance dictionary coding

Suppose you want to compress a large array of values with
(relatively) few distinct values. For example, maybe you have 16 distinct 64-bit
Expand Down Expand Up @@ -50,10 +46,6 @@ working directly over the compressed data would be ideal.
If you must decode gigabytes of data to RAM or to disk, then you should expect
to be wasting enormous quantities of CPU cycles.

## Credit

Builds on work done by Eric Daniel for ``parquet-cpp``.

## Usage

```bash
Expand All @@ -69,6 +61,10 @@ dictionaries, the AVX2 gather approach is still remarkably faster. See results b
Intel architectures to be less impressive because the ``vpgather`` instruction that we use was
quite slow in its early incarnations.

The case with large dictionary as implemented here is somewhat pessimistic as it assumes
that all values are equally likely. In most instances, a dictionary will have frequent
values, more likely to be repeated. This will reduce the number of cache misses.

```bash
$ ./decodebenchmark
For this benchmark, use a recent (Skylake) Intel processor for best results.
Expand Down Expand Up @@ -235,9 +231,231 @@ Actual dict size: 1048235
AVXdecodetocache(&t,newbuf,bufsize): 8.07 cycles per decoded value
```

## Experimental results (Knights Landing, August 24th 2016)

We find that an AVX-512 dictionary decoder can be than twice as fast as an AVX dictionary
decoder which is in turn twice as fast as a scalar decoder
on a recent Intel processor (Knights Landing) for modest dictionary sizes.
The case with large dictionary as implemented here is somewhat pessimistic as it assumes
that all values are equally likely.


```bash
$ ./decodebenchmark
For this benchmark, use a recent (Skylake) Intel processor for best results.
Intel processor: UNKNOWN compiler version: 5.3.0 AVX2 is available.
Using array sizes of 8388608 values or 65536 kiB.
testing with dictionary of size 2
Actual dict size: 2
scalarcodec.uncompress(t,newbuf): 7.75 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.39 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.26 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.22 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.06 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.48 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.14 cycles per decoded value

testing with dictionary of size 4
Actual dict size: 4
scalarcodec.uncompress(t,newbuf): 7.83 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.49 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.35 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.10 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.49 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.21 cycles per decoded value

testing with dictionary of size 8
Actual dict size: 8
scalarcodec.uncompress(t,newbuf): 7.27 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 6.99 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.17 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.10 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.59 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.25 cycles per decoded value

testing with dictionary of size 16
Actual dict size: 16
scalarcodec.uncompress(t,newbuf): 7.98 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.65 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.32 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.16 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.68 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.34 cycles per decoded value

testing with dictionary of size 32
Actual dict size: 32
scalarcodec.uncompress(t,newbuf): 7.92 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.63 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.27 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.19 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.65 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.43 cycles per decoded value

testing with dictionary of size 64
Actual dict size: 64
scalarcodec.uncompress(t,newbuf): 8.05 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.76 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.32 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.31 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.25 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.85 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.66 cycles per decoded value

testing with dictionary of size 128
Actual dict size: 128
scalarcodec.uncompress(t,newbuf): 6.64 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 6.36 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.19 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.34 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.28 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.83 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.57 cycles per decoded value

testing with dictionary of size 256
Actual dict size: 256
scalarcodec.uncompress(t,newbuf): 8.07 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.87 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.39 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.39 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.35 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 1.95 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.69 cycles per decoded value

testing with dictionary of size 512
Actual dict size: 512
scalarcodec.uncompress(t,newbuf): 8.07 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.87 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.32 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.52 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.48 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 2.04 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.76 cycles per decoded value

testing with dictionary of size 1024
Actual dict size: 1024
scalarcodec.uncompress(t,newbuf): 8.22 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.97 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.43 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.63 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.57 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 2.05 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.83 cycles per decoded value

testing with dictionary of size 2048
Actual dict size: 2048
scalarcodec.uncompress(t,newbuf): 7.97 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 7.69 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.37 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.76 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.64 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 2.11 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 1.91 cycles per decoded value

testing with dictionary of size 4096
Actual dict size: 4096
scalarcodec.uncompress(t,newbuf): 8.53 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 8.20 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.67 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.58 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.56 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 2.55 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 2.35 cycles per decoded value

testing with dictionary of size 8192
Actual dict size: 8192
scalarcodec.uncompress(t,newbuf): 8.66 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 8.27 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.79 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.92 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.86 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 2.80 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 2.54 cycles per decoded value

testing with dictionary of size 16384
Actual dict size: 16384
scalarcodec.uncompress(t,newbuf): 8.85 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 8.55 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.95 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 4.05 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.87 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 3.14 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 2.96 cycles per decoded value

testing with dictionary of size 32768
Actual dict size: 32768
scalarcodec.uncompress(t,newbuf): 6.75 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 6.81 cycles per decoded value
avxcodec.uncompress(t,newbuf): 6.94 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 3.68 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 3.58 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 3.41 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 3.24 cycles per decoded value

testing with dictionary of size 65536
Actual dict size: 65536
scalarcodec.uncompress(t,newbuf): 11.75 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 13.76 cycles per decoded value
avxcodec.uncompress(t,newbuf): 9.64 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 5.29 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 5.50 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 4.54 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 4.66 cycles per decoded value

testing with dictionary of size 131072
Actual dict size: 131072
scalarcodec.uncompress(t,newbuf): 19.07 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 19.53 cycles per decoded value
avxcodec.uncompress(t,newbuf): 17.02 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 11.02 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 11.01 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 8.03 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 8.01 cycles per decoded value

testing with dictionary of size 262144
Actual dict size: 262144
scalarcodec.uncompress(t,newbuf): 22.84 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 23.12 cycles per decoded value
avxcodec.uncompress(t,newbuf): 20.63 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 16.57 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 16.45 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 13.68 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 13.69 cycles per decoded value

testing with dictionary of size 524288
Actual dict size: 524288
scalarcodec.uncompress(t,newbuf): 22.34 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 22.54 cycles per decoded value
avxcodec.uncompress(t,newbuf): 20.36 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 16.30 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 16.34 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 14.91 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 14.94 cycles per decoded value

testing with dictionary of size 1048576
Actual dict size: 1048235
scalarcodec.uncompress(t,newbuf): 21.93 cycles per decoded value
decodetocache(&sc, &t,newbuf,bufsize): 22.11 cycles per decoded value
avxcodec.uncompress(t,newbuf): 19.91 cycles per decoded value
AVXDictCODEC::fastuncompress(t,newbuf): 16.33 cycles per decoded value
AVXdecodetocache(&t,newbuf,bufsize): 16.30 cycles per decoded value
AVX512DictCODEC::fastuncompress(t,newbuf): 15.32 cycles per decoded value
AVX512decodetocache(&t,newbuf,bufsize): 15.31 cycles per decoded value

```

## Limitations
- We do not have a realistic usage of the dictionary values (we use a uniform distribution).
- For simplicity, we assume that the dictionary is made of 64-bit words. It is hard-coded in the code, but not a fundamental limitation: the code would be faster with smaller words.
- This code is not meant to be use in production. It is a demo.
- This code makes up its own convenient format. It is not meant to plug as-is into an existing framework.
- We assume that the arrays are large. If you have tiny arrays... well...
- We effectively measure steady-state throughput. So we ignore costs such as loading up the dictionary in CPU cache.

## Authors
Daniel Lemire and Eric Daniel (motivated by ``parquet-cpp``)


6 changes: 3 additions & 3 deletions scripts/avx512dict.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def plurial(number):
print("static void avx512unpackdict0(const __m512i * compressed, const myint64 * dictionary, int64_t * pout) {");
print(" (void) compressed;");
print(" __m512i * out = (__m512i *) pout;");
print(" const __m512i uniquew = _mm512_set1_epi64x(dictionary[0]);");
print(" const __m512i uniquew = _mm512_set1_epi64(dictionary[0]);");
print(" for(int k = 0; k < {0}; k++) {{".format(howmany(0)/howmany64perwideword()));
print(" _mm512_storeu_si512(out + k, uniquew);")
print(" }");
Expand All @@ -58,12 +58,12 @@ def plurial(number):
maskstr = " _mm512_and_si512 ( mask, {0}) "
if (bit == 32) : maskstr = " {0} " # no need
oldword = 0
print(" w0 = _mm512_lddqu_si512 (compressed);")
print(" w0 = _mm512_loadu_si512 (compressed);")
for j in range(howmany(bit)/16):
firstword = j * bit / 32
secondword = (j * bit + bit - 1)/32
if(secondword > oldword):
print(" w{0} = _mm512_lddqu_si512 (compressed + {1});".format(secondword%2,secondword))
print(" w{0} = _mm512_loadu_si512 (compressed + {1});".format(secondword%2,secondword))
oldword = secondword
firstshift = (j*bit) % 32
firstshiftstr = "_mm512_srli_epi32( w{0} , "+str(firstshift)+") "
Expand Down
Loading

0 comments on commit e685bff

Please sign in to comment.