-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Incremental improvements to parquet metadata #248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I like the idea of moving statistics away from byte arrays where possible. Hopefully separating this from the larger V3 discussion will help gain some traction.
9: optional byte max1; | ||
10: optional byte min1; | ||
11: optional i16 max2; | ||
12: optional i16 min2; | ||
13: optional i32 max4; | ||
14: optional i32 min4; | ||
15: optional i64 max8; | ||
16: optional i64 min8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I was thinking a union could be used for these, which would reduce the (logical) complexity some (it would also include the binary pair), but would add encoding overhead. But how about a single fixed-width i64 pair. These are zig-zag encoded anyway, so a single byte value won't take any extra space in the file. And then there would be fewer members in the thrift struct as well. We could save a byte in the file for booleans, but then that adds to the struct as well, so probably not worth adding a bool pair.
/* The index into FileMetadata.schema (list<SchemaElement>) for this column */ | ||
17: optional i32 schema_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have found this helpful 😄 But I'm sure others would argue that it's easy enough to create an in memory map during schema parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this added we can skip metadata for columns that do not participate in a row group. Without this index we won't be able to reconstruct the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear this would be a future optimization to drop columns that are completely null?
Do you have actual data to support this? For |
The slowness is actually |
* column. Floating point values are bitcasted to integers. Variable length | ||
* values are set in min_value/max_value. | ||
* | ||
* Min and Max are the lower and upper bound values for the column, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this imply is_min_value_exact is only used for variable length values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think not. I can definitely see a use for fixed length byte array as well. Also a lazy implementation could set min to 0 and max to INT_MAX to indicate only positive values are present.
caba3c6
to
da18b38
Compare
I am retracting this PR in favor of: https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit and: |
Incremental parquet metadata improvements
This is an alternative proposal to #242 which can be executed with minimal changes to parquet readers/writers.
Wide schemata (large number of columns) make
FileMetadata
very slow to parse. The majority of the time is spent in parsing thriftlist<>
and in particular heavily nestedlist<StructType>
fields. These are notoriously slow to decode because they are variable sized and they involve extra allocations. In this proposal we avoid such fields as much as possible to improve decoding. In addition we allow columns that do not participate in a row group to have their column chunk metadata skipped.Jira
Commits
Documentation