Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Incremental improvements to parquet metadata #248

Closed
wants to merge 1 commit into from

Conversation

alkis
Copy link
Contributor

@alkis alkis commented May 21, 2024

Incremental parquet metadata improvements

This is an alternative proposal to #242 which can be executed with minimal changes to parquet readers/writers.

Wide schemata (large number of columns) make FileMetadata very slow to parse. The majority of the time is spent in parsing thrift list<> and in particular heavily nested list<StructType> fields. These are notoriously slow to decode because they are variable sized and they involve extra allocations. In this proposal we avoid such fields as much as possible to improve decoding. In addition we allow columns that do not participate in a row group to have their column chunk metadata skipped.

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I like the idea of moving statistics away from byte arrays where possible. Hopefully separating this from the larger V3 discussion will help gain some traction.

Comment on lines +273 to +280
9: optional byte max1;
10: optional byte min1;
11: optional i16 max2;
12: optional i16 min2;
13: optional i32 max4;
14: optional i32 min4;
15: optional i64 max8;
16: optional i64 min8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was thinking a union could be used for these, which would reduce the (logical) complexity some (it would also include the binary pair), but would add encoding overhead. But how about a single fixed-width i64 pair. These are zig-zag encoded anyway, so a single byte value won't take any extra space in the file. And then there would be fewer members in the thrift struct as well. We could save a byte in the file for booleans, but then that adds to the struct as well, so probably not worth adding a bool pair.

Comment on lines +836 to +837
/* The index into FileMetadata.schema (list<SchemaElement>) for this column */
17: optional i32 schema_index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have found this helpful 😄 But I'm sure others would argue that it's easy enough to create an in memory map during schema parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this added we can skip metadata for columns that do not participate in a row group. Without this index we won't be able to reconstruct the map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear this would be a future optimization to drop columns that are completely null?

@pitrou
Copy link
Member

pitrou commented May 28, 2024

The majority of the time is spent in parsing thrift list<> and binary fields. These are notoriously slow to decode because they are variable sized.

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

@alkis
Copy link
Contributor Author

alkis commented May 28, 2024

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

The slowness is actually list<> and particularly list<StructType> with nested lists. binary is not as large of a problem.

* column. Floating point values are bitcasted to integers. Variable length
* values are set in min_value/max_value.
*
* Min and Max are the lower and upper bound values for the column,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply is_min_value_exact is only used for variable length values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think not. I can definitely see a use for fixed length byte array as well. Also a lazy implementation could set min to 0 and max to INT_MAX to indicate only positive values are present.

@alkis alkis force-pushed the incremental-metadata-improvements branch from caba3c6 to da18b38 Compare May 29, 2024 12:08
@alkis
Copy link
Contributor Author

alkis commented May 29, 2024

@alkis alkis closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants