Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2473: Clarify records can not be split across v2 pages or PageIndex #244

Merged
merged 2 commits into from
May 31, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,13 @@ enum BoundaryOrder {

/** Data page header */
struct DataPageHeader {
/** Number of values, including NULLs, in this data page. **/
/**
* Number of values, including NULLs, in this data page.
*
* If a ColumnIndex is present, a page must begin at a record
* boundary (repetition_level = 0). Otherwise, pages may begin
* within a record (repetition_level > 0).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could instead state that records should not be split across pages, but that readers should tolerate this in the abscence of a page index or v2 data pages, as certain writers historically wrote such data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that splitting pages will make the file less compatible, even though some writers historically did that.

However, I would like to ensure there is consensus on updating the spec to say that writers should do something before changing the spec

**/
1: required i32 num_values

/** Encoding used for this data page **/
Expand Down Expand Up @@ -625,7 +631,11 @@ struct DataPageHeaderV2 {
/** Number of NULL values, in this data page.
Number of non-null = num_values - num_nulls which is also the number of values in the data section **/
2: required i32 num_nulls
/** Number of rows in this data page. which means pages change on record boundaries (r = 0) **/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also propose expanding the r = 0 terminology to be explicitly repetition_level = 0 to match the conventions in the rest of the documentation

/**
* Number of rows in this data page. Every page must begin at a
* record boundary (repetition_level = 0): records must **not** be
* split across page boundaries when using V2 data pages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a row is a record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

**/
3: required i32 num_rows
/** Encoding used for data in this page **/
4: required Encoding encoding
Expand Down Expand Up @@ -995,8 +1005,9 @@ struct PageLocation {
2: required i32 compressed_page_size

/**
* Index within the RowGroup of the first row of the page; this means pages
* change on record boundaries (r = 0).
* Index within the RowGroup of the first row of the page. When a
* ColumnIndex is present, pages begin on record boundaries
* (repetition_level = 0).
*/
3: required i64 first_row_index
}
Expand Down Expand Up @@ -1178,4 +1189,3 @@ struct FileCryptoMetaData {
* and (possibly) columns **/
2: optional binary key_metadata
}