Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2336: Add caching key to CodecFactory #1134

Merged
merged 4 commits into from
Sep 18, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 7, 2023

Make sure you have checked all steps below.

The CODEC_BY_NAME is static and may be used across different configurations. If a codec is initialized it will be re-used no matter what the configuration is. This is a problem when there are different compression levels used.

https://github.com/apache/parquet-mr/blob/515734c373f69b5250e8b63eb3d1c973da893b63/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L45-L46

Therefore we need to cache per compression level as well.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@Fokko Fokko force-pushed the fd-add-cache-key branch from c818223 to 1246444 Compare August 7, 2023 11:12
@@ -234,11 +234,11 @@ protected CompressionCodec getCodec(CompressionCodecName codecName) {
if (codecClassName == null) {
return null;
}
CompressionCodec codec = CODEC_BY_NAME.get(codecClassName);
String codecCacheKey = this.cacheKey(codecName);
CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CODEC_BY_NAME is protected, I think this could break something that is relying on the cache, although I'm not sure why someone would access it directly. Maybe that visibility is an accident?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is a concern, we can cache the old key (w/o level) as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm not too worried about this, I don't see anyone doing this. At least nobody in the Apache org: https://github.com/search?q=org%3Aapache%20CODEC_BY_NAME&type=code :)

if (codec != null) {
return codec;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace change.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. It changes the cache, but I can't think of why anyone would use it directly.

@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise we would be able to remove Hadoop dependencies fairly easily.

@@ -234,11 +234,11 @@ protected CompressionCodec getCodec(CompressionCodecName codecName) {
if (codecClassName == null) {
return null;
}
CompressionCodec codec = CODEC_BY_NAME.get(codecClassName);
String codecCacheKey = this.cacheKey(codecName);
CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is a concern, we can cache the old key (w/o level) as well.

@wgtmac
Copy link
Member

wgtmac commented Aug 9, 2023

Overall this looks good. It changes the cache, but I can't think of why anyone would use it directly.

@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise we would be able to remove Hadoop dependencies fairly easily.

I have added the aircompressor library when I supported the LZ4_RAW codec. Not sure if this makes it easier. @rdblue

@Fokko
Copy link
Contributor Author

Fokko commented Aug 9, 2023

Thanks @wgtmac for jumping in here

@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise, we would be able to remove Hadoop dependencies fairly easily.

I have a PR that I need to revisit apache/iceberg#7369. It requires some awkward changes to make it work. One issue with the Aircompressor codec it doesn't provide Brolti due to licensing issues.

@wgtmac
Copy link
Member

wgtmac commented Aug 9, 2023

Just curious, is brotli widely adopted? It seems that it does not have an official java encoder implementation.

@Fokko
Copy link
Contributor Author

Fokko commented Aug 9, 2023

@wgtmac I think it is quite arcane, maybe we can make the aircompressor the default at some point, but I think we should keep support around, otherwise, folks won't be able to access their data.

level = configuration.get("parquet.compression.codec.zstd.level");
if (level == null) {
// keep "io.compression.codec.zstd.level" for backwards compatibility
level = configuration.get("io.compression.codec.zstd.level");
Copy link
Contributor

@zhongyujiang zhongyujiang Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cache the old config levels ? It's already been deprecated and currently only the new config is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let's remove it then 👍🏻

@Fokko Fokko merged commit 4de3d93 into apache:master Sep 18, 2023
@Fokko
Copy link
Contributor Author

Fokko commented Sep 18, 2023

Thanks everyone for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants