Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2351: Set options with Configuration #1157

Closed

Conversation

amousavigourabi
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Parameterizes existing fixture to instantiate writer options using Configuration instead.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

The functionality is implemented in a way in which options set using the builder.withXXX(...) method will always override the options passed by the configuration. This is done in order to not make things break in unexpected ways. Having a configuration which used to not have any effect on these options override an explicit option would be a bit odd after all.

ParquetProperties.builder();

private boolean isPageSizeSet = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, these would make the code less maintainable. Not sure if Optional would make them more organized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these booleans aren't amazing. Optional fields would not solve the issue of making withXXX methods always override the configuration, next to this, Optionals shouldn't be used as fields IMO, as that is not their intended use. Maybe an approach with a Set would work better?

@@ -195,10 +200,28 @@ private static void prepareFile(WriterVersion version, Path file) throws IOExcep
writeData(f, writer);
}

private static void prepareFileWithConf(WriterVersion version, Path file) throws IOException {
Configuration configuration = new Configuration();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that these specific configurations defined in the ParquetOutputFormat are solely used for ParquetOutputFormat to create a RecordWriter, which actually puts all of them into a ParquetProperties.

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L457-L484

IIUC, relying on settings from Hadoop configuration is discouraged, we should use ParquetProperties to set all those things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configurations are used to build the encodingProps ParquetProperties within ParquetWriter's Builder, which are used elsewhere as well. While the usage of Configuration in this way may be discouraged, the current situation of settings sometimes not being picked up by the ParquetWriter is inconsistent with both ParquetReader behaviour and user expectations.

@amousavigourabi amousavigourabi requested a review from wgtmac October 3, 2023 12:51
@@ -341,7 +341,7 @@ public static void setMaxPaddingSize(Configuration conf, int maxPaddingSize) {
conf.setInt(MAX_PADDING_BYTES, maxPaddingSize);
}

private static int getMaxPaddingSize(Configuration conf) {
public static int getMaxPaddingSize(Configuration conf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing method from private to public would need to add documentation and unit tests

Copy link
Contributor Author

@amousavigourabi amousavigourabi Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll add javadoc to the rest of the publics in the class as well while I'm at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants