PARQUET-2336: Add caching key to CodecFactory #1134

Fokko · 2023-08-07T10:59:44Z

Make sure you have checked all steps below.

The CODEC_BY_NAME is static and may be used across different configurations. If a codec is initialized it will be re-used no matter what the configuration is. This is a problem when there are different compression levels used.

https://github.com/apache/parquet-mr/blob/515734c373f69b5250e8b63eb3d1c973da893b63/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L45-L46

Therefore we need to cache per compression level as well.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-2336: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2336
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

rdblue · 2023-08-07T15:17:06Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

@@ -234,11 +234,11 @@ protected CompressionCodec getCodec(CompressionCodecName codecName) {
    if (codecClassName == null) {
      return null;
    }
-    CompressionCodec codec = CODEC_BY_NAME.get(codecClassName);
+    String codecCacheKey = this.cacheKey(codecName);
+    CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey);


Since CODEC_BY_NAME is protected, I think this could break something that is relying on the cache, although I'm not sure why someone would access it directly. Maybe that visibility is an accident?

If that is a concern, we can cache the old key (w/o level) as well.

Personally, I'm not too worried about this, I don't see anyone doing this. At least nobody in the Apache org: https://github.com/search?q=org%3Aapache%20CODEC_BY_NAME&type=code :)

rdblue · 2023-08-07T15:17:15Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

    if (codec != null) {
      return codec;
    }
-


Nit: unnecessary whitespace change.

rdblue

Overall this looks good. It changes the cache, but I can't think of why anyone would use it directly.

@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise we would be able to remove Hadoop dependencies fairly easily.

wgtmac · 2023-08-09T14:49:00Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

@@ -234,11 +234,11 @@ protected CompressionCodec getCodec(CompressionCodecName codecName) {
    if (codecClassName == null) {
      return null;
    }
-    CompressionCodec codec = CODEC_BY_NAME.get(codecClassName);
+    String codecCacheKey = this.cacheKey(codecName);
+    CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey);


If that is a concern, we can cache the old key (w/o level) as well.

wgtmac · 2023-08-09T14:51:08Z

Overall this looks good. It changes the cache, but I can't think of why anyone would use it directly.

@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise we would be able to remove Hadoop dependencies fairly easily.

I have added the aircompressor library when I supported the LZ4_RAW codec. Not sure if this makes it easier. @rdblue

Fokko · 2023-08-09T15:05:24Z

Thanks @wgtmac for jumping in here

@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise, we would be able to remove Hadoop dependencies fairly easily.

I have a PR that I need to revisit apache/iceberg#7369. It requires some awkward changes to make it work. One issue with the Aircompressor codec it doesn't provide Brolti due to licensing issues.

wgtmac · 2023-08-09T15:13:00Z

Just curious, is brotli widely adopted? It seems that it does not have an official java encoder implementation.

Fokko · 2023-08-09T16:53:24Z

@wgtmac I think it is quite arcane, maybe we can make the aircompressor the default at some point, but I think we should keep support around, otherwise, folks won't be able to access their data.

zhongyujiang · 2023-08-18T08:57:43Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

+        level = configuration.get("parquet.compression.codec.zstd.level");
+        if (level == null) {
+          // keep "io.compression.codec.zstd.level" for backwards compatibility
+          level = configuration.get("io.compression.codec.zstd.level");


Do we need to cache the old config levels ? It's already been deprecated and currently only the new config is used.

Thanks, let's remove it then 👍🏻

…he-key

Fokko · 2023-09-18T19:45:42Z

Thanks everyone for the reviews!

PARQUET-2336: Add caching key to CodecFactory

1246444

Fokko force-pushed the fd-add-cache-key branch from c818223 to 1246444 Compare August 7, 2023 11:12

rdblue reviewed Aug 7, 2023

View reviewed changes

rdblue approved these changes Aug 7, 2023

View reviewed changes

wgtmac approved these changes Aug 9, 2023

View reviewed changes

zhongyujiang reviewed Aug 18, 2023

View reviewed changes

Fokko added 3 commits September 13, 2023 13:25

Remove old config

41840c9

Fix the key

eda481b

Merge branch 'master' of github.com:apache/parquet-mr into fd-add-cac…

340ab5c

…he-key

Fokko merged commit 4de3d93 into apache:master Sep 18, 2023

Fokko deleted the fd-add-cache-key branch September 18, 2023 19:45

Fokko mentioned this pull request Apr 23, 2024

PARQUET-2347: Add interface layer between Parquet and Hadoop Configuration #1141

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2336: Add caching key to CodecFactory #1134

PARQUET-2336: Add caching key to CodecFactory #1134

Fokko commented Aug 7, 2023 •

edited

Loading

rdblue Aug 7, 2023

wgtmac Aug 9, 2023

Fokko Aug 9, 2023

rdblue Aug 7, 2023

rdblue left a comment

wgtmac Aug 9, 2023

wgtmac commented Aug 9, 2023

Fokko commented Aug 9, 2023

wgtmac commented Aug 9, 2023

Fokko commented Aug 9, 2023

zhongyujiang Aug 18, 2023 •

edited

Loading

Fokko Sep 13, 2023

Fokko commented Sep 18, 2023

PARQUET-2336: Add caching key to CodecFactory #1134

PARQUET-2336: Add caching key to CodecFactory #1134

Conversation

Fokko commented Aug 7, 2023 • edited Loading

Jira

Tests

Commits

Documentation

rdblue Aug 7, 2023

Choose a reason for hiding this comment

wgtmac Aug 9, 2023

Choose a reason for hiding this comment

Fokko Aug 9, 2023

Choose a reason for hiding this comment

rdblue Aug 7, 2023

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

wgtmac Aug 9, 2023

Choose a reason for hiding this comment

wgtmac commented Aug 9, 2023

Fokko commented Aug 9, 2023

wgtmac commented Aug 9, 2023

Fokko commented Aug 9, 2023

zhongyujiang Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

Fokko Sep 13, 2023

Choose a reason for hiding this comment

Fokko commented Sep 18, 2023

Fokko commented Aug 7, 2023 •

edited

Loading

zhongyujiang Aug 18, 2023 •

edited

Loading