Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-869 Configurable min/max record counts for block size check #447

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pradeepg26
Copy link

Make min/max record counts for block size check are no longer hard coded inside InternalParquetRecordWriter.

@@ -102,6 +103,8 @@ private void initStore() {
MessageColumnIO columnIO = new ColumnIOFactory(validating).getColumnIO(schema);
this.recordConsumer = columnIO.getRecordWriter(columnStore);
writeSupport.prepareForWrite(recordConsumer);
System.out.println(String.format("Created ParquetWriter with [%d, %d] for block size checks. Estimation(%s). BlockSize(%d)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use SLF4J for logging. You can see examples in the rest of the library.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... missed removing this statement.

@@ -142,13 +145,23 @@ private void checkBlockSizeReached() throws IOException {
LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount);
flushRowGroupToStore();
initStore();
recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
if (estimateNextBlockSizeCheck) {
Copy link
Contributor

@rdblue rdblue Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this setting, or is it included to match the page size checking? I don't think we would ever want to set it to true because it is important to be under the max row group size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on the comments @rdblue... do you think instead of a min/max record count for row group size checks, we should check every n rows?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a closer look at this code. The only change with the estimateNextBlockSizeCheck is that it uses a static min number of records before performing the next size check if disabled.

If estimateNextBlockSizeCheck is enabled, then existing behavior is preserved where we try to estimate when to perform the next size check by performing the check at the halfway point of the predicted number of records (unless it's larger than the max).

So, I added the estimateNextBlockSizeCheck to keep it consistent with page size checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency here is fine. I'm not sure we will use it, but I can certainly think of valid use cases for it.

public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min";
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max";
public static final String ESTIMATE_PAGE_SIZE_CHECK = "parquet.page.size.check.estimate";
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size_row_check_min";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page properties should not change, and the row group properties should match the naming convention for pages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... it was changed for internal reasons... forgot to change it back.

public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = true;
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_PAGE_SIZE_CHECK = 100;
public static final int DEFAULT_MAXIMUM_RECORD_COUNT_FOR_PAGE_SIZE_CHECK = 10000;
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_BLOCK_SIZE_CHECK = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of block, please use "row group". Block is an overused term that we are avoiding in the public API, except for those places where it already appears.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

private final ParquetFileWriter parquetFileWriter;
private final WriteSupport<T> writeSupport;
private final MessageType schema;
private final Map<String, String> extraMetaData;
private final long rowGroupSize;
private final boolean estimateNextBlockSizeCheck;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was rowGroupSize not used? Why was it removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this code over a year ago... I don't remember, but I'll check. I think it was an unused variable.

@rdblue
Copy link
Contributor

rdblue commented Jan 31, 2018

Thanks for doing this, @pradeepg26! We've occasionally had use cases where it would have been nice. My two main concerns are the unnecessary boolean controlling whether or not to estimate when to do the next check and the naming. Naming should use "row group" instead of "block".

@pwais
Copy link

pwais commented Jul 4, 2019

bump? note disclosure on https://eng.uber.com/petastorm/

@rdblue
Copy link
Contributor

rdblue commented Jul 4, 2019

@pwais, this was replaced by #470 that includes updates for problems in this PR. Unfortunately, other committers decided they did not want to change internal APIs so it was not committed.

@pwais
Copy link

pwais commented Jul 4, 2019

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants