DRAFT: Incremental improvements to parquet metadata #248

alkis · 2024-05-21T15:25:41Z

Incremental parquet metadata improvements

This is an alternative proposal to #242 which can be executed with minimal changes to parquet readers/writers.

Wide schemata (large number of columns) make FileMetadata very slow to parse. The majority of the time is spent in parsing thrift list<> and in particular heavily nested list<StructType> fields. These are notoriously slow to decode because they are variable sized and they involve extra allocations. In this proposal we avoid such fields as much as possible to improve decoding. In addition we allow columns that do not participate in a row group to have their column chunk metadata skipped.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

etseidl

Thanks. I like the idea of moving statistics away from byte arrays where possible. Hopefully separating this from the larger V3 discussion will help gain some traction.

etseidl · 2024-05-22T18:00:22Z

src/main/thrift/parquet.thrift

+   9: optional byte max1;
+   10: optional byte min1;
+   11: optional i16 max2;
+   12: optional i16 min2;
+   13: optional i32 max4;
+   14: optional i32 min4;
+   15: optional i64 max8;
+   16: optional i64 min8;


At first I was thinking a union could be used for these, which would reduce the (logical) complexity some (it would also include the binary pair), but would add encoding overhead. But how about a single fixed-width i64 pair. These are zig-zag encoded anyway, so a single byte value won't take any extra space in the file. And then there would be fewer members in the thrift struct as well. We could save a byte in the file for booleans, but then that adds to the struct as well, so probably not worth adding a bool pair.

etseidl · 2024-05-22T18:02:26Z

src/main/thrift/parquet.thrift

+  /* The index into FileMetadata.schema (list<SchemaElement>) for this column */
+  17: optional i32 schema_index;


I would have found this helpful 😄 But I'm sure others would argue that it's easy enough to create an in memory map during schema parsing.

With this added we can skip metadata for columns that do not participate in a row group. Without this index we won't be able to reconstruct the map.

Just to be clear this would be a future optimization to drop columns that are completely null?

pitrou · 2024-05-28T14:10:23Z

The majority of the time is spent in parsing thrift list<> and binary fields. These are notoriously slow to decode because they are variable sized.

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

alkis · 2024-05-28T16:27:13Z

Do you have actual data to support this? For list I could understand (but the problem is really the number of nested elements), but for binary this sounds rather unexpected.

The slowness is actually list<> and particularly list<StructType> with nested lists. binary is not as large of a problem.

emkornfield · 2024-05-28T18:53:23Z

src/main/thrift/parquet.thrift

+    * column. Floating point values are bitcasted to integers. Variable length
+    * values are set in min_value/max_value.
+    *
+    * Min and Max are the lower and upper bound values for the column,


does this imply is_min_value_exact is only used for variable length values?

I would think not. I can definitely see a use for fixed length byte array as well. Also a lazy implementation could set min to 0 and max to INT_MAX to indicate only positive values are present.

alkis · 2024-05-29T19:38:10Z

I am retracting this PR in favor of:

https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit

and:

#252
#253
#254

alkis mentioned this pull request May 21, 2024

DRAFT: Parquet 3 metadata with decoupled column metadata #242

Closed

3 tasks

etseidl reviewed May 22, 2024

View reviewed changes

emkornfield mentioned this pull request May 27, 2024

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

Closed

emkornfield reviewed May 28, 2024

View reviewed changes

Incremental improvements to parquet metadata

da18b38

alkis force-pushed the incremental-metadata-improvements branch from caba3c6 to da18b38 Compare May 29, 2024 12:08

alkis closed this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Incremental improvements to parquet metadata #248

DRAFT: Incremental improvements to parquet metadata #248

alkis commented May 21, 2024 •

edited

Loading

etseidl left a comment

etseidl May 22, 2024

etseidl May 22, 2024

alkis May 29, 2024

emkornfield May 29, 2024

pitrou commented May 28, 2024

alkis commented May 28, 2024

emkornfield May 28, 2024

etseidl May 28, 2024

alkis commented May 29, 2024

		/* The index into FileMetadata.schema (list<SchemaElement>) for this column */
		17: optional i32 schema_index;

DRAFT: Incremental improvements to parquet metadata #248

DRAFT: Incremental improvements to parquet metadata #248

Conversation

alkis commented May 21, 2024 • edited Loading

Incremental parquet metadata improvements

Jira

Commits

Documentation

etseidl left a comment

Choose a reason for hiding this comment

etseidl May 22, 2024

Choose a reason for hiding this comment

etseidl May 22, 2024

Choose a reason for hiding this comment

alkis May 29, 2024

Choose a reason for hiding this comment

emkornfield May 29, 2024

Choose a reason for hiding this comment

pitrou commented May 28, 2024

alkis commented May 28, 2024

emkornfield May 28, 2024

Choose a reason for hiding this comment

etseidl May 28, 2024

Choose a reason for hiding this comment

alkis commented May 29, 2024

alkis commented May 21, 2024 •

edited

Loading