-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2336: Add caching key to CodecFactory #1134
Conversation
@@ -234,11 +234,11 @@ protected CompressionCodec getCodec(CompressionCodecName codecName) { | |||
if (codecClassName == null) { | |||
return null; | |||
} | |||
CompressionCodec codec = CODEC_BY_NAME.get(codecClassName); | |||
String codecCacheKey = this.cacheKey(codecName); | |||
CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since CODEC_BY_NAME
is protected
, I think this could break something that is relying on the cache, although I'm not sure why someone would access it directly. Maybe that visibility is an accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is a concern, we can cache the old key (w/o level) as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'm not too worried about this, I don't see anyone doing this. At least nobody in the Apache org: https://github.com/search?q=org%3Aapache%20CODEC_BY_NAME&type=code :)
if (codec != null) { | ||
return codec; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary whitespace change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. It changes the cache, but I can't think of why anyone would use it directly.
@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise we would be able to remove Hadoop dependencies fairly easily.
@@ -234,11 +234,11 @@ protected CompressionCodec getCodec(CompressionCodecName codecName) { | |||
if (codecClassName == null) { | |||
return null; | |||
} | |||
CompressionCodec codec = CODEC_BY_NAME.get(codecClassName); | |||
String codecCacheKey = this.cacheKey(codecName); | |||
CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is a concern, we can cache the old key (w/o level) as well.
I have added the aircompressor library when I supported the LZ4_RAW codec. Not sure if this makes it easier. @rdblue |
Thanks @wgtmac for jumping in here
I have a PR that I need to revisit apache/iceberg#7369. It requires some awkward changes to make it work. One issue with the Aircompressor codec it doesn't provide Brolti due to licensing issues. |
Just curious, is brotli widely adopted? It seems that it does not have an official java encoder implementation. |
@wgtmac I think it is quite arcane, maybe we can make the |
level = configuration.get("parquet.compression.codec.zstd.level"); | ||
if (level == null) { | ||
// keep "io.compression.codec.zstd.level" for backwards compatibility | ||
level = configuration.get("io.compression.codec.zstd.level"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to cache the old config levels ? It's already been deprecated and currently only the new config is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let's remove it then 👍🏻
Thanks everyone for the reviews! |
Make sure you have checked all steps below.
The
CODEC_BY_NAME
is static and may be used across different configurations. If a codec is initialized it will be re-used no matter what the configuration is. This is a problem when there are different compression levels used.https://github.com/apache/parquet-mr/blob/515734c373f69b5250e8b63eb3d1c973da893b63/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L45-L46
Therefore we need to cache per compression level as well.
Jira
Tests
Commits
Documentation