-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-409: Add a configuration key that controls min/max row count for block size check #495
Conversation
@@ -45,8 +45,10 @@ | |||
public static final boolean DEFAULT_IS_DICTIONARY_ENABLED = true; | |||
public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0; | |||
public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true; | |||
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming a public constants is a backward incompatible change. How about keeping the original constants as an alias for the new constants, mark and document that they are deprecated with a reference to the new names, and use the new constants in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally forgot about that, fixed.
@@ -147,12 +147,12 @@ private void checkBlockSizeReached() throws IOException { | |||
LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount); | |||
flushRowGroupToStore(); | |||
initStore(); | |||
recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); | |||
recordCountForNextMemCheck = min(max(props.getMinRowCountForBlockSizeCheck(), recordCount / 2), props.getMaxRowCountForBlockSizeCheck()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, it seems that the local constants MINIMUM_RECORD_COUNT_FOR_CHECK and MAXIMUM_RECORD_COUNT_FOR_CHECK are not needed anymore. Could you please remove them? (recordCountForNextMemCheck should be initialized by using the corresponding value from ParquetProperties.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I also removed the unused rowGroupSize instance variable
@rgruener, @gszadovszky, there's already an open PR for this and a duplicate issue: PARQUET-869. The PR is #470, which I've updated for @pradeepg26. Sorry for the confusion and duplication, but I'd like to continue with that one since the contribution from Pradeep was made earlier. I just shouldn't commit it myself because I ended up making the changes I requested when I backported this to our Parquet version. |
Ah, I wish I saw that before. The duplicate JIRA issue should probably be closed. I am fine having that one continue, we just would like this change to get into the next release since we have been patching this in ourselves. |
Adds way to control the min/max amount of rows to pass when checking on the block size instead of hard coded values.