Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-409: Add a configuration key that controls min/max row count for block size check #495

Closed
wants to merge 0 commits into from

Conversation

rgruener
Copy link

Adds way to control the min/max amount of rows to pass when checking on the block size instead of hard coded values.

@@ -45,8 +45,10 @@
public static final boolean DEFAULT_IS_DICTIONARY_ENABLED = true;
public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0;
public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true;
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming a public constants is a backward incompatible change. How about keeping the original constants as an alias for the new constants, mark and document that they are deprecated with a reference to the new names, and use the new constants in the file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally forgot about that, fixed.

@@ -147,12 +147,12 @@ private void checkBlockSizeReached() throws IOException {
LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount);
flushRowGroupToStore();
initStore();
recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
recordCountForNextMemCheck = min(max(props.getMinRowCountForBlockSizeCheck(), recordCount / 2), props.getMaxRowCountForBlockSizeCheck());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, it seems that the local constants MINIMUM_RECORD_COUNT_FOR_CHECK and MAXIMUM_RECORD_COUNT_FOR_CHECK are not needed anymore. Could you please remove them? (recordCountForNextMemCheck should be initialized by using the corresponding value from ParquetProperties.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I also removed the unused rowGroupSize instance variable

@rdblue
Copy link
Contributor

rdblue commented Jun 18, 2018

@rgruener, @gszadovszky, there's already an open PR for this and a duplicate issue: PARQUET-869. The PR is #470, which I've updated for @pradeepg26.

Sorry for the confusion and duplication, but I'd like to continue with that one since the contribution from Pradeep was made earlier. I just shouldn't commit it myself because I ended up making the changes I requested when I backported this to our Parquet version.

@rgruener
Copy link
Author

Ah, I wish I saw that before. The duplicate JIRA issue should probably be closed. I am fine having that one continue, we just would like this change to get into the next release since we have been patching this in ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants