-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-869 Configurable min/max record counts for block size check #447
base: master
Are you sure you want to change the base?
Conversation
@@ -102,6 +103,8 @@ private void initStore() { | |||
MessageColumnIO columnIO = new ColumnIOFactory(validating).getColumnIO(schema); | |||
this.recordConsumer = columnIO.getRecordWriter(columnStore); | |||
writeSupport.prepareForWrite(recordConsumer); | |||
System.out.println(String.format("Created ParquetWriter with [%d, %d] for block size checks. Estimation(%s). BlockSize(%d)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use SLF4J for logging. You can see examples in the rest of the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops... missed removing this statement.
@@ -142,13 +145,23 @@ private void checkBlockSizeReached() throws IOException { | |||
LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount); | |||
flushRowGroupToStore(); | |||
initStore(); | |||
recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK); | |||
if (estimateNextBlockSizeCheck) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this setting, or is it included to match the page size checking? I don't think we would ever want to set it to true because it is important to be under the max row group size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on the comments @rdblue... do you think instead of a min/max record count for row group size checks, we should check every n rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a closer look at this code. The only change with the estimateNextBlockSizeCheck
is that it uses a static min
number of records before performing the next size check if disabled.
If estimateNextBlockSizeCheck
is enabled, then existing behavior is preserved where we try to estimate when to perform the next size check by performing the check at the halfway point of the predicted number of records (unless it's larger than the max).
So, I added the estimateNextBlockSizeCheck
to keep it consistent with page size checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency here is fine. I'm not sure we will use it, but I can certainly think of valid use cases for it.
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min"; | ||
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max"; | ||
public static final String ESTIMATE_PAGE_SIZE_CHECK = "parquet.page.size.check.estimate"; | ||
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size_row_check_min"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The page properties should not change, and the row group properties should match the naming convention for pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops... it was changed for internal reasons... forgot to change it back.
public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = true; | ||
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_PAGE_SIZE_CHECK = 100; | ||
public static final int DEFAULT_MAXIMUM_RECORD_COUNT_FOR_PAGE_SIZE_CHECK = 10000; | ||
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_BLOCK_SIZE_CHECK = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of block, please use "row group". Block is an overused term that we are avoiding in the public API, except for those places where it already appears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
private final ParquetFileWriter parquetFileWriter; | ||
private final WriteSupport<T> writeSupport; | ||
private final MessageType schema; | ||
private final Map<String, String> extraMetaData; | ||
private final long rowGroupSize; | ||
private final boolean estimateNextBlockSizeCheck; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was rowGroupSize not used? Why was it removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this code over a year ago... I don't remember, but I'll check. I think it was an unused variable.
Thanks for doing this, @pradeepg26! We've occasionally had use cases where it would have been nice. My two main concerns are the unnecessary boolean controlling whether or not to estimate when to do the next check and the naming. Naming should use "row group" instead of "block". |
bump? note disclosure on https://eng.uber.com/petastorm/ |
Thank you! |
Make min/max record counts for block size check are no longer hard coded inside InternalParquetRecordWriter.