-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Incremental improvements to parquet metadata #248
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -242,43 +242,42 @@ struct SizeStatistics { | |
* All fields are optional. | ||
*/ | ||
struct Statistics { | ||
/** | ||
* DEPRECATED: min and max value of the column. Use min_value and max_value. | ||
* | ||
* Values are encoded using PLAIN encoding, except that variable-length byte | ||
* arrays do not include a length prefix. | ||
* | ||
* These fields encode min and max values determined by signed comparison | ||
* only. New files should use the correct order for a column's logical type | ||
* and store the values in the min_value and max_value fields. | ||
* | ||
* To support older readers, these may be set when the column order is | ||
* signed. | ||
*/ | ||
/* DEPRECATED: do not use */ | ||
1: optional binary max; | ||
2: optional binary min; | ||
/** count of null value in the column */ | ||
3: optional i64 null_count; | ||
/** count of distinct values occurring */ | ||
4: optional i64 distinct_count; | ||
/** | ||
* Lower and upper bound values for the column, determined by its ColumnOrder. | ||
* Only one pair of max_value/min_value, max1/min1, max2/min2, max4/min4, | ||
* max8/min8 can be set. The pair is determined by the physical type of the | ||
* column. Floating point values are bitcasted to integers. Variable length | ||
* values are set in min_value/max_value. | ||
* | ||
* Min and Max are the lower and upper bound values for the column, | ||
* respectively, as determined by its ColumnOrder. | ||
* | ||
* These may be the actual minimum and maximum values found on a page or column | ||
* chunk, but can also be (more compact) values that do not exist on a page or | ||
* column chunk. For example, instead of storing "Blart Versenwald III", a writer | ||
* may set min_value="B", max_value="C". Such more compact values must still be | ||
* valid values within the column's logical type. | ||
* | ||
* Values are encoded using PLAIN encoding, except that variable-length byte | ||
* arrays do not include a length prefix. | ||
*/ | ||
5: optional binary max_value; | ||
6: optional binary min_value; | ||
/** If true, max_value is the actual maximum value for a column */ | ||
7: optional bool is_max_value_exact; | ||
/** If true, min_value is the actual minimum value for a column */ | ||
8: optional bool is_min_value_exact; | ||
9: optional byte max1; | ||
10: optional byte min1; | ||
11: optional i16 max2; | ||
12: optional i16 min2; | ||
13: optional i32 max4; | ||
14: optional i32 min4; | ||
15: optional i64 max8; | ||
16: optional i64 min8; | ||
Comment on lines
+273
to
+280
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At first I was thinking a union could be used for these, which would reduce the (logical) complexity some (it would also include the binary pair), but would add encoding overhead. But how about a single fixed-width i64 pair. These are zig-zag encoded anyway, so a single byte value won't take any extra space in the file. And then there would be fewer members in the thrift struct as well. We could save a byte in the file for booleans, but then that adds to the struct as well, so probably not worth adding a bool pair. |
||
} | ||
|
||
/** Empty structs to use as logical type annotations */ | ||
|
@@ -490,7 +489,7 @@ enum Encoding { | |
// GROUP_VAR_INT = 1; | ||
|
||
/** | ||
* Deprecated: Dictionary encoding. The values in the dictionary are encoded in the | ||
* DEPRECATED: Dictionary encoding. The values in the dictionary are encoded in the | ||
* plain type. | ||
* in a data page use RLE_DICTIONARY instead. | ||
* in a Dictionary page use PLAIN instead | ||
|
@@ -772,15 +771,15 @@ struct PageEncodingStats { | |
* Description for column metadata | ||
*/ | ||
struct ColumnMetaData { | ||
/** Type of this column **/ | ||
1: required Type type | ||
/* DEPRECATED: can be found in SchemaElement */ | ||
1: optional Type type | ||
|
||
/** Set of all encodings used for this column. The purpose is to validate | ||
* whether we can decode those pages. **/ | ||
2: required list<Encoding> encodings | ||
|
||
/** Path in schema **/ | ||
3: required list<string> path_in_schema | ||
/* DEPRECATED: can be found in SchemaElement */ | ||
3: optional list<string> path_in_schema | ||
|
||
/** Compression codec **/ | ||
4: required CompressionCodec codec | ||
|
@@ -833,6 +832,9 @@ struct ColumnMetaData { | |
* filter pushdown. | ||
*/ | ||
16: optional SizeStatistics size_statistics; | ||
|
||
/* The index into FileMetadata.schema (list<SchemaElement>) for this column */ | ||
17: optional i32 schema_index; | ||
Comment on lines
+836
to
+837
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would have found this helpful 😄 But I'm sure others would argue that it's easy enough to create an in memory map during schema parsing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With this added we can skip metadata for columns that do not participate in a row group. Without this index we won't be able to reconstruct the map. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to be clear this would be a future optimization to drop columns that are completely null? |
||
} | ||
|
||
struct EncryptionWithFooterKey { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this imply is_min_value_exact is only used for variable length values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think not. I can definitely see a use for fixed length byte array as well. Also a lazy implementation could set min to 0 and max to INT_MAX to indicate only positive values are present.