-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP #671
base: master
Are you sure you want to change the base?
Conversation
Neat! I was just poking around the codecs code so this is really interesting and timely. I'm currently looking at how to run the parquet-benchmarks project... I'll see if I can get a clean run on master and your branch for LZ4 and GZIP to compare. (It looks like LZO benchmarks are disabled on master.) Edit: There are no LZ4 benchmarks currently in the parquet-benchmarks module, and it looks like the run scripts need a bit of clean-up and attention! In the meantime, I managed a single, not very clean run of the |
Benchmark results with the patch applied
|
Benchmark results on master branch:
|
Pruned results for comparing GZIP perf. I don't see any significant speedup or regression. Considering these compressors/decompressors don't use native resources, it would be cheap to create a compressor/decompressor for each page. This in turn allows for reading pages concurrently including implementing pre-fetching, removing the need to pool the de/compressor instances and making the overall code simpler. |
71f2215
to
b9a7dbe
Compare
@@ -68,7 +67,7 @@ public ParquetRecordWriter( | |||
MessageType schema, | |||
Map<String, String> extraMetaData, | |||
int blockSize, int pageSize, | |||
BytesCompressor compressor, | |||
BytesInputCompressor compressor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantic versioning check failed on these removed constructors. Though BytesCompressor
is marked as deprecated, I think you still should use it here instead of BytesInputCompressor
, so that this PR can be reased in a 1.x release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Thanks for pointing that out, @nandorKollar . Apparently, this and the other constructor taking BytesInputCompressor isn't used (at least I didn't find any references to it within the parquet project). I wonder if it would be ok with semantic versioning check to get rid of them. Going to try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out removing constructors even though they are not called anywhere in the code isn't allowed either. Switching the type back to BytesCompressor did the trick.
@samarthjain why did you remove Snappy support? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also check TestDirectCodecFactory if we can run unit tests for LZO and LZ4?
@@ -27,7 +27,7 @@ | |||
import java.util.Map; | |||
import java.util.Set; | |||
import java.util.zip.CRC32; | |||
|
|||
import org.apache.parquet.bytes.ByteBufferAllocator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid rearranging the imports. It makes merges unnecessarily cumbersome. It is fine to remove unused imports, but those, which are still used should not be rearranged.
|
||
|
||
import java.lang.reflect.Method; | ||
import java.lang.reflect.InvocationTargetException; | ||
import java.io.IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before, revert rearranging.
import static java.lang.Math.max; | ||
import static java.lang.Math.min; | ||
import static org.apache.parquet.Preconditions.checkNotNull; | ||
|
||
import java.io.IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before.
@@ -68,7 +67,7 @@ public ParquetRecordWriter( | |||
MessageType schema, | |||
Map<String, String> extraMetaData, | |||
int blockSize, int pageSize, | |||
BytesCompressor compressor, | |||
CodecFactory.BytesCompressor compressor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change too, add the relevant import as before.
@@ -107,7 +106,7 @@ public ParquetRecordWriter( | |||
MessageType schema, | |||
Map<String, String> extraMetaData, | |||
long blockSize, int pageSize, | |||
BytesCompressor compressor, | |||
CodecFactory.BytesCompressor compressor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the previous.
@Override | ||
public BytesInput compress(BytesInput bytes) throws IOException { | ||
compressedOutBuffer.reset(); | ||
CompressionOutputStream cos = hadoopCodec.createOutputStream(compressedOutBuffer, compressor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use try-with-resource here to close the stream even when an exception happens. Also, I don't think calling finish is required here, since close() on the stream calls it as the first statement.
@nandorKollar - it looks like Parquet has its own implementation for Snappy which from what I can tell doesn't depend on native. Also, adding snappy support for airliftcomrpessor was causing snappy tests to fail. So I dropped support for it. I have updated the PR title also to reflect the same. |
@nandorKollar - I just pushed a commit to address changes you requested. Sorry for the delay. I had to punt working on this for various reasons. |
f5c76a6
to
4feb369
Compare
@samarthjain thanks for addressing my comments, and sorry for the late reply. I have two additional question. I'm wondering if we might want to introduce a new configuration option to turn Airlift codecs on and off, in case something is wrong with Airlift, clients can still fall back to the original implementation. Not sure if it worths the effort, @gszadovszky what do you think? I also noticed, that in other codecs we use |
Without reviewing this change and knowing too much about Airlift I would say the configuration might make sense. Meanwhile, the main purpose of using a pure java compression codec over the ones provided by Hadoop is to be independent from Hadoop. However, our code is hardly relying on Hadoop (the whole read/write is implemented in parquet-hadoop) the target is to make parquet-mr work without Hadoop and its dependencies. So, I would suggest introducing new features in a way that it does not depend on Hadoop or it would be easy to remove the Hadoop dependencies. |
@nandorKollar - I am not exactly sure where I can add this configuration which I was thinking of naming as We want both For compressor: For decompressor: so that the correct decompressor can be used by the |
@samarthjain thanks for the work. I am looking to deploy zstd parquet into prod, but that requires new hadoop with native library support which is not practical in many prod use-cases. Since airlift is pure Java implementation, what's the performance implications for zstd? I saw there is a benchmark for GZIP, but I don't see benchmark for other codecs. Also, do we consider to use zstd-jin which is a Java library that packages native implementation of zstd for different platforms in jar? |
4feb369
to
5d24671
Compare
Force pushed a new commit that makes it configurable whether to use Airlift based compressors or not. Also added tests and GZIP benchmarks for Airlift compressors. Benchmark results reveal that there are no performance improvements or regressions when using Airlift GZIP vs plain GZIP.
|
|
@nandorKollar, @rdblue, @danielcweeks - if you have cycles, could you please take a look at this PR. |
5d24671
to
47871e2
Compare
No description provided.