PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP #671

samarthjain · 2019-08-28T18:24:08Z

No description provided.

RyanSkraba · 2019-08-29T09:31:23Z

Neat! I was just poking around the codecs code so this is really interesting and timely.

I'm currently looking at how to run the parquet-benchmarks project... I'll see if I can get a clean run on master and your branch for LZ4 and GZIP to compare. (It looks like LZO benchmarks are disabled on master.)

Edit: There are no LZ4 benchmarks currently in the parquet-benchmarks module, and it looks like the run scripts need a bit of clean-up and attention! In the meantime, I managed a single, not very clean run of the WriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP with and without the change. No improvement or regression noted!

samarthjain · 2019-08-29T19:23:25Z

Benchmark results with the patch applied

Benchmark                                                                Mode  Cnt   Score   Error  Units
ReadBenchmarks.read1MRowsBS256MPS4MUncompressed                         thrpt   25   0.947 ± 0.011  ops/s
ReadBenchmarks.read1MRowsBS256MPS8MUncompressed                         thrpt   25   0.952 ± 0.010  ops/s
ReadBenchmarks.read1MRowsBS512MPS4MUncompressed                         thrpt   25   0.938 ± 0.015  ops/s
ReadBenchmarks.read1MRowsBS512MPS8MUncompressed                         thrpt   25   0.960 ± 0.012  ops/s
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeGZIP                    thrpt   25   0.725 ± 0.007  ops/s
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeSNAPPY                  thrpt   25   0.902 ± 0.005  ops/s
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeUncompressed            thrpt   25   0.940 ± 0.010  ops/s
PageChecksumReadBenchmarks.read100KRowsGzipWithVerification                ss    5   0.502 ± 0.169   s/op
PageChecksumReadBenchmarks.read100KRowsGzipWithoutVerification             ss    5   0.562 ± 0.299   s/op
PageChecksumReadBenchmarks.read100KRowsSnappyWithVerification              ss    5   0.649 ± 0.975   s/op
PageChecksumReadBenchmarks.read100KRowsSnappyWithoutVerification           ss    5   0.519 ± 0.095   s/op
PageChecksumReadBenchmarks.read100KRowsUncompressedWithVerification        ss    5   0.531 ± 0.205   s/op
PageChecksumReadBenchmarks.read100KRowsUncompressedWithoutVerification     ss    5   0.495 ± 0.182   s/op
PageChecksumReadBenchmarks.read10MRowsGzipWithVerification                 ss    5  13.505 ± 2.291   s/op
PageChecksumReadBenchmarks.read10MRowsGzipWithoutVerification              ss    5  13.529 ± 2.485   s/op
PageChecksumReadBenchmarks.read10MRowsSnappyWithVerification               ss    5  10.781 ± 1.075   s/op
PageChecksumReadBenchmarks.read10MRowsSnappyWithoutVerification            ss    5  10.711 ± 1.377   s/op
PageChecksumReadBenchmarks.read10MRowsUncompressedWithVerification         ss    5  10.822 ± 0.898   s/op
PageChecksumReadBenchmarks.read10MRowsUncompressedWithoutVerification      ss    5  10.497 ± 0.961   s/op
PageChecksumReadBenchmarks.read1MRowsGzipWithVerification                  ss    5   1.946 ± 1.070   s/op
PageChecksumReadBenchmarks.read1MRowsGzipWithoutVerification               ss    5   1.778 ± 0.684   s/op
PageChecksumReadBenchmarks.read1MRowsSnappyWithVerification                ss    5   1.817 ± 1.941   s/op
PageChecksumReadBenchmarks.read1MRowsSnappyWithoutVerification             ss    5   1.851 ± 1.808   s/op
PageChecksumReadBenchmarks.read1MRowsUncompressedWithVerification          ss    5   1.570 ± 0.242   s/op
PageChecksumReadBenchmarks.read1MRowsUncompressedWithoutVerification       ss    5   1.766 ± 1.573   s/op

samarthjain · 2019-08-30T15:47:41Z

Benchmark results on master branch:

Benchmark                                                                Mode  Cnt   Score    Error  Units
ReadBenchmarks.read1MRowsBS256MPS4MUncompressed                         thrpt   25   0.952 ±  0.008  ops/s
ReadBenchmarks.read1MRowsBS256MPS8MUncompressed                         thrpt   25   0.947 ±  0.008  ops/s
ReadBenchmarks.read1MRowsBS512MPS4MUncompressed                         thrpt   25   0.957 ±  0.010  ops/s
ReadBenchmarks.read1MRowsBS512MPS8MUncompressed                         thrpt   25   0.956 ±  0.009  ops/s
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeGZIP                    thrpt   25   0.731 ±  0.007  ops/s
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeSNAPPY                  thrpt   25   0.897 ±  0.008  ops/s
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeUncompressed            thrpt   25   0.935 ±  0.013  ops/s
PageChecksumReadBenchmarks.read100KRowsGzipWithVerification                ss    5   0.525 ±  0.079   s/op
PageChecksumReadBenchmarks.read100KRowsGzipWithoutVerification             ss    5   0.483 ±  0.093   s/op
PageChecksumReadBenchmarks.read100KRowsSnappyWithVerification              ss    5   0.545 ±  0.408   s/op
PageChecksumReadBenchmarks.read100KRowsSnappyWithoutVerification           ss    5   0.517 ±  0.133   s/op
PageChecksumReadBenchmarks.read100KRowsUncompressedWithVerification        ss    5   0.501 ±  0.213   s/op
PageChecksumReadBenchmarks.read100KRowsUncompressedWithoutVerification     ss    5   0.506 ±  0.385   s/op
PageChecksumReadBenchmarks.read10MRowsGzipWithVerification                 ss    5  14.217 ± 10.173   s/op
PageChecksumReadBenchmarks.read10MRowsGzipWithoutVerification              ss    5  13.189 ±  1.396   s/op
PageChecksumReadBenchmarks.read10MRowsSnappyWithVerification               ss    5  11.369 ±  1.966   s/op
PageChecksumReadBenchmarks.read10MRowsSnappyWithoutVerification            ss    5  10.964 ±  3.167   s/op
PageChecksumReadBenchmarks.read10MRowsUncompressedWithVerification         ss    5  11.147 ±  2.056   s/op
PageChecksumReadBenchmarks.read10MRowsUncompressedWithoutVerification      ss    5  10.554 ±  1.415   s/op
PageChecksumReadBenchmarks.read1MRowsGzipWithVerification                  ss    5   1.745 ±  0.482   s/op
PageChecksumReadBenchmarks.read1MRowsGzipWithoutVerification               ss    5   1.788 ±  0.417   s/op
PageChecksumReadBenchmarks.read1MRowsSnappyWithVerification                ss    5   1.935 ±  1.977   s/op
PageChecksumReadBenchmarks.read1MRowsSnappyWithoutVerification             ss    5   1.505 ±  0.172   s/op
PageChecksumReadBenchmarks.read1MRowsUncompressedWithVerification          ss    5   1.790 ±  1.657   s/op
PageChecksumReadBenchmarks.read1MRowsUncompressedWithoutVerification       ss    5   1.751 ±  1.790   s/op

samarthjain · 2019-08-30T16:27:30Z

Benchmark Name	Master	Airlift Codecs
ReadBenchmarks.read1MRowsBS256MPS4MUncompressed	0.952	0.947
ReadBenchmarks.read1MRowsBS256MPS8MUncompressed	0.947	0.952
ReadBenchmarks.read1MRowsBS512MPS4MUncompressed	0.957	0.938
ReadBenchmarks.read1MRowsBS512MPS8MUncompressed	0.956	0.96
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeGZIP	0.731	0.725
PageChecksumReadBenchmarks.read100KRowsGzipWithVerification	0.525	0.502
PageChecksumReadBenchmarks.read100KRowsGzipWithoutVerification	0.483	0.562
PageChecksumReadBenchmarks.read10MRowsGzipWithVerification	14.217	13.505
PageChecksumReadBenchmarks.read10MRowsGzipWithoutVerification	13.189	13.529
PageChecksumReadBenchmarks.read1MRowsGzipWithVerification	1.745	1.946
PageChecksumReadBenchmarks.read1MRowsGzipWithoutVerification	1.788	1.778

Pruned results for comparing GZIP perf. I don't see any significant speedup or regression.

Considering these compressors/decompressors don't use native resources, it would be cheap to create a compressor/decompressor for each page. This in turn allows for reading pages concurrently including implementing pre-fetching, removing the need to pool the de/compressor instances and making the overall code simpler.

nandorKollar · 2019-09-03T07:57:39Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java

@@ -68,7 +67,7 @@ public ParquetRecordWriter(
      MessageType schema,
      Map<String, String> extraMetaData,
      int blockSize, int pageSize,
-      BytesCompressor compressor,
+      BytesInputCompressor compressor,


Semantic versioning check failed on these removed constructors. Though BytesCompressor is marked as deprecated, I think you still should use it here instead of BytesInputCompressor, so that this PR can be reased in a 1.x release.

Ah! Thanks for pointing that out, @nandorKollar . Apparently, this and the other constructor taking BytesInputCompressor isn't used (at least I didn't find any references to it within the parquet project). I wonder if it would be ok with semantic versioning check to get rid of them. Going to try that.

Turns out removing constructors even though they are not called anywhere in the code isn't allowed either. Switching the type back to BytesCompressor did the trick.

nandorKollar · 2019-09-04T10:54:29Z

@samarthjain why did you remove Snappy support?

gszadovszky

Could you also check TestDirectCodecFactory if we can run unit tests for LZO and LZ4?

nandorKollar · 2019-09-04T11:02:41Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java

@@ -27,7 +27,7 @@
 import java.util.Map;
 import java.util.Set;
 import java.util.zip.CRC32;
-
+import org.apache.parquet.bytes.ByteBufferAllocator;


Please avoid rearranging the imports. It makes merges unnecessarily cumbersome. It is fine to remove unused imports, but those, which are still used should not be rearranged.

nandorKollar · 2019-09-04T11:05:19Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java

-
-
-import java.lang.reflect.Method;
-import java.lang.reflect.InvocationTargetException;
 import java.io.IOException;


Same as before, revert rearranging.

nandorKollar · 2019-09-04T11:05:53Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java

-import static java.lang.Math.max;
-import static java.lang.Math.min;
-import static org.apache.parquet.Preconditions.checkNotNull;
-
 import java.io.IOException;


Same as before.

nandorKollar · 2019-09-04T11:07:06Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java

@@ -68,7 +67,7 @@ public ParquetRecordWriter(
      MessageType schema,
      Map<String, String> extraMetaData,
      int blockSize, int pageSize,
-      BytesCompressor compressor,
+      CodecFactory.BytesCompressor compressor,


Please revert this change too, add the relevant import as before.

nandorKollar · 2019-09-04T11:07:18Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java

@@ -107,7 +106,7 @@ public ParquetRecordWriter(
      MessageType schema,
      Map<String, String> extraMetaData,
      long blockSize, int pageSize,
-      BytesCompressor compressor,
+      CodecFactory.BytesCompressor compressor,


Same as the previous.

nandorKollar · 2019-09-04T11:14:54Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/AirliftCompressor.java

+  @Override
+  public BytesInput compress(BytesInput bytes) throws IOException {
+    compressedOutBuffer.reset();
+    CompressionOutputStream cos = hadoopCodec.createOutputStream(compressedOutBuffer, compressor);


Please use try-with-resource here to close the stream even when an exception happens. Also, I don't think calling finish is required here, since close() on the stream calls it as the first statement.

samarthjain · 2019-10-16T23:28:44Z

@samarthjain why did you remove Snappy support?

@nandorKollar - it looks like Parquet has its own implementation for Snappy which from what I can tell doesn't depend on native. Also, adding snappy support for airliftcomrpessor was causing snappy tests to fail. So I dropped support for it. I have updated the PR title also to reflect the same.

samarthjain · 2019-10-16T23:30:13Z

@nandorKollar - I just pushed a commit to address changes you requested. Sorry for the delay. I had to punt working on this for various reasons.

nandorKollar · 2020-04-08T07:42:52Z

@samarthjain thanks for addressing my comments, and sorry for the late reply. I have two additional question. I'm wondering if we might want to introduce a new configuration option to turn Airlift codecs on and off, in case something is wrong with Airlift, clients can still fall back to the original implementation. Not sure if it worths the effort, @gszadovszky what do you think?

I also noticed, that in other codecs we use org.apache.hadoop.io.compress.CodecPool, should we consider using it for Airlift compressors too? We can address this in a separate ticket though.

gszadovszky · 2020-04-08T08:03:24Z

Without reviewing this change and knowing too much about Airlift I would say the configuration might make sense. Meanwhile, the main purpose of using a pure java compression codec over the ones provided by Hadoop is to be independent from Hadoop. However, our code is hardly relying on Hadoop (the whole read/write is implemented in parquet-hadoop) the target is to make parquet-mr work without Hadoop and its dependencies. So, I would suggest introducing new features in a way that it does not depend on Hadoop or it would be easy to remove the Hadoop dependencies.

samarthjain · 2020-04-09T21:47:39Z

@nandorKollar - I am not exactly sure where I can add this configuration which I was thinking of naming as parquet.airlift.compressors.enable

We want both ParquetReadOptions (with the config defined in ParquetInputFormat ) and ParquetRecordWriter to be able to use the config for instantiating the correct (de)compressor. Does that mean we need separate compression related configs for read and write?

For compressor:
In ParquetRecordWriter here:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java#L150

For decompressor:
In ParquetReadOptions here:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java#L302

so that the correct decompressor can be used by the ParquetFileReader over here:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1036

dbtsai · 2020-04-30T18:16:43Z

@samarthjain thanks for the work. I am looking to deploy zstd parquet into prod, but that requires new hadoop with native library support which is not practical in many prod use-cases.

Since airlift is pure Java implementation, what's the performance implications for zstd? I saw there is a benchmark for GZIP, but I don't see benchmark for other codecs.

Also, do we consider to use zstd-jin which is a Java library that packages native implementation of zstd for different platforms in jar?

samarthjain · 2020-06-28T10:35:07Z

Force pushed a new commit that makes it configurable whether to use Airlift based compressors or not. Also added tests and GZIP benchmarks for Airlift compressors. Benchmark results reveal that there are no performance improvements or regressions when using Airlift GZIP vs plain GZIP.

PageChecksumReadBenchmarks.read10MRowsAirliftGzipWithVerification                    3     6.431 ±    0.741
PageChecksumReadBenchmarks.read10MRowsAirliftGzipWithoutVerification                 3     6.605 ±    0.709
PageChecksumReadBenchmarks.read10MRowsGzipWithVerification                           3     6.468 ±    0.700
PageChecksumReadBenchmarks.read10MRowsGzipWithoutVerification                        3     6.583 ±    1.538

PageChecksumWriteBenchmarks.write10MRowsAirliftGzipWithChecksums                     3    36.333 ±    0.510
PageChecksumWriteBenchmarks.write10MRowsAirliftGzipWithoutChecksums                  3    36.069 ±    1.096
PageChecksumWriteBenchmarks.write10MRowsGzipWithChecksums                            3    36.141 ±    1.095
PageChecksumWriteBenchmarks.write10MRowsGzipWithoutChecksums                         3    36.174 ±    5.125


ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeAirliftGZIP                          3     0.898 ±    1.254
ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeGZIP                                 3     0.891 ±    1.201

samarthjain · 2020-06-29T06:59:27Z

@dbtsai

Since airlift is pure Java implementation, what's the performance implications for zstd? I saw there is a benchmark for GZIP, but I don't see benchmark for other codecs.
It looks like the zstd Airlift implementation doesn't implement the Hadoop APIs. It can be integrated within Parquet, but will take some work worth definitely worthy of another PR.

samarthjain · 2020-06-29T18:21:09Z

@nandorKollar, @rdblue, @danielcweeks - if you have cycles, could you please take a look at this PR.

samarthjain closed this Aug 29, 2019

samarthjain reopened this Aug 29, 2019

samarthjain changed the title ~~PARQUET-1643 Use airlift codecs for LZ4, LZ0 and GZIP~~ PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP and SNAPPY Sep 2, 2019

samarthjain force-pushed the airliftcodecs branch from 71f2215 to b9a7dbe Compare September 2, 2019 22:25

nandorKollar reviewed Sep 3, 2019

View reviewed changes

gszadovszky reviewed Sep 4, 2019

View reviewed changes

nandorKollar requested changes Sep 4, 2019

View reviewed changes

nandorKollar mentioned this pull request Sep 4, 2019

PARQUET-1286: Crypto package #614

Merged

samarthjain changed the title ~~PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP and SNAPPY~~ PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP Oct 16, 2019

samarthjain mentioned this pull request Nov 14, 2019

Prefetch parquet data pages apache/iceberg#647

Closed

samarthjain force-pushed the airliftcodecs branch 4 times, most recently from f5c76a6 to 4feb369 Compare February 26, 2020 19:50

samarthjain force-pushed the airliftcodecs branch from 4feb369 to 5d24671 Compare June 28, 2020 10:27

Add Airlift based compression for GZIP, LZO and LZ4 codecs

47871e2

samarthjain force-pushed the airliftcodecs branch from 5d24671 to 47871e2 Compare June 29, 2020 18:34

asfimport mentioned this pull request Jun 29, 2020

Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs #2359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP #671

PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP #671

samarthjain commented Aug 28, 2019

RyanSkraba commented Aug 29, 2019 •

edited

Loading

samarthjain commented Aug 29, 2019 •

edited by rdblue

Loading

samarthjain commented Aug 30, 2019 •

edited by rdblue

Loading

samarthjain commented Aug 30, 2019

nandorKollar Sep 3, 2019

samarthjain Sep 3, 2019

samarthjain Sep 3, 2019

nandorKollar commented Sep 4, 2019

gszadovszky left a comment

nandorKollar Sep 4, 2019

nandorKollar Sep 4, 2019

nandorKollar Sep 4, 2019

nandorKollar Sep 4, 2019

nandorKollar Sep 4, 2019

nandorKollar Sep 4, 2019

samarthjain commented Oct 16, 2019

samarthjain commented Oct 16, 2019

nandorKollar commented Apr 8, 2020

gszadovszky commented Apr 8, 2020

samarthjain commented Apr 9, 2020 •

edited

Loading

dbtsai commented Apr 30, 2020

samarthjain commented Jun 28, 2020

samarthjain commented Jun 29, 2020

samarthjain commented Jun 29, 2020

PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP #671

Are you sure you want to change the base?

PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP #671

Conversation

samarthjain commented Aug 28, 2019

RyanSkraba commented Aug 29, 2019 • edited Loading

samarthjain commented Aug 29, 2019 • edited by rdblue Loading

samarthjain commented Aug 30, 2019 • edited by rdblue Loading

samarthjain commented Aug 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nandorKollar commented Sep 4, 2019

gszadovszky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samarthjain commented Oct 16, 2019

samarthjain commented Oct 16, 2019

nandorKollar commented Apr 8, 2020

gszadovszky commented Apr 8, 2020

samarthjain commented Apr 9, 2020 • edited Loading

dbtsai commented Apr 30, 2020

samarthjain commented Jun 28, 2020

samarthjain commented Jun 29, 2020

samarthjain commented Jun 29, 2020

RyanSkraba commented Aug 29, 2019 •

edited

Loading

samarthjain commented Aug 29, 2019 •

edited by rdblue

Loading

samarthjain commented Aug 30, 2019 •

edited by rdblue

Loading

samarthjain commented Apr 9, 2020 •

edited

Loading