-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2351: Set options with Configuration #1157
PARQUET-2351: Set options with Configuration #1157
Conversation
ParquetProperties.builder(); | ||
|
||
private boolean isPageSizeSet = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, these would make the code less maintainable. Not sure if Optional would make them more organized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these booleans aren't amazing. Optional fields would not solve the issue of making withXXX methods always override the configuration, next to this, Optionals shouldn't be used as fields IMO, as that is not their intended use. Maybe an approach with a Set would work better?
@@ -195,10 +200,28 @@ private static void prepareFile(WriterVersion version, Path file) throws IOExcep | |||
writeData(f, writer); | |||
} | |||
|
|||
private static void prepareFileWithConf(WriterVersion version, Path file) throws IOException { | |||
Configuration configuration = new Configuration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that these specific configurations defined in the ParquetOutputFormat are solely used for ParquetOutputFormat to create a RecordWriter, which actually puts all of them into a ParquetProperties.
IIUC, relying on settings from Hadoop configuration is discouraged, we should use ParquetProperties to set all those things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These configurations are used to build the encodingProps
ParquetProperties within ParquetWriter's Builder, which are used elsewhere as well. While the usage of Configuration in this way may be discouraged, the current situation of settings sometimes not being picked up by the ParquetWriter is inconsistent with both ParquetReader behaviour and user expectations.
@@ -341,7 +341,7 @@ public static void setMaxPaddingSize(Configuration conf, int maxPaddingSize) { | |||
conf.setInt(MAX_PADDING_BYTES, maxPaddingSize); | |||
} | |||
|
|||
private static int getMaxPaddingSize(Configuration conf) { | |||
public static int getMaxPaddingSize(Configuration conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing method from private to public would need to add documentation and unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll add javadoc to the rest of the publics in the class as well while I'm at it.
f1a074a
to
a4b253e
Compare
Make sure you have checked all steps below.
Jira
Tests
Parameterizes existing fixture to instantiate writer options using Configuration instead.
Commits
Documentation
The functionality is implemented in a way in which options set using the
builder.withXXX(...)
method will always override the options passed by the configuration. This is done in order to not make things break in unexpected ways. Having a configuration which used to not have any effect on these options override an explicit option would be a bit odd after all.