-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2473: Clarify records can not be split across v2 pages or PageIndex #244
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -578,7 +578,13 @@ enum BoundaryOrder { | |
|
||
/** Data page header */ | ||
struct DataPageHeader { | ||
/** Number of values, including NULLs, in this data page. **/ | ||
/** | ||
* Number of values, including NULLs, in this data page. | ||
* | ||
* If a OffsetIndex is present, a page must begin at a record | ||
* boundary (repetition_level = 0). Otherwise, pages may begin | ||
* within a record (repetition_level > 0). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we could instead state that records should not be split across pages, but that readers should tolerate this in the abscence of a page index or v2 data pages, as certain writers historically wrote such data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with you that splitting pages will make the file less compatible, even though some writers historically did that. However, I would like to ensure there is consensus on updating the spec to say that writers should do something before changing the spec |
||
**/ | ||
1: required i32 num_values | ||
|
||
/** Encoding used for this data page **/ | ||
|
@@ -625,7 +631,11 @@ struct DataPageHeaderV2 { | |
/** Number of NULL values, in this data page. | ||
Number of non-null = num_values - num_nulls which is also the number of values in the data section **/ | ||
2: required i32 num_nulls | ||
/** Number of rows in this data page. which means pages change on record boundaries (r = 0) **/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also propose expanding the |
||
/** | ||
* Number of rows in this data page. Every page must begin at a | ||
* record boundary (repetition_level = 0): records must **not** be | ||
* split across page boundaries when using V2 data pages. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So a row is a record? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
**/ | ||
3: required i32 num_rows | ||
/** Encoding used for data in this page **/ | ||
4: required Encoding encoding | ||
|
@@ -995,8 +1005,9 @@ struct PageLocation { | |
2: required i32 compressed_page_size | ||
|
||
/** | ||
* Index within the RowGroup of the first row of the page; this means pages | ||
* change on record boundaries (r = 0). | ||
* Index within the RowGroup of the first row of the page. When an | ||
* OffsetIndex is present, pages must begin on record boundaries | ||
* (repetition_level = 0). | ||
*/ | ||
3: required i64 first_row_index | ||
} | ||
|
@@ -1178,4 +1189,3 @@ struct FileCryptoMetaData { | |
* and (possibly) columns **/ | ||
2: optional binary key_metadata | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify the row/record terminology? They seem to mean the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree row/record seem to mean the same thing
I double checked and it appears the rest of the file is inconsistent in the terminology as well
For example the repetition level documentation refers to records
parquet-format/src/main/thrift/parquet.thrift
Lines 180 to 192 in 8d59c7d
but there are several fields that are named
num_rows
that clearly refer to rows.parquet-format/src/main/thrift/parquet.thrift
Lines 623 to 631 in 8d59c7d
In this PR I followed the term used in the repetition level documentation as I think it is the most relevant.
I will start a conversation on the mailing list about using consistent terminology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mailing list conversation: https://lists.apache.org/thread/0gxjk4tphxls0gcrc7lt775pf8s7gtz3
Currently the consensus seems to be to use "row"
I will make a follow on PR to change that terminology after this PR is merged.
I would like to retain the "record" terminology here to remain consistent with the current wording in the repetition levels section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#256