Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2139: Deprecate ColumnChunk::file_offset field #440

Merged
merged 3 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,38 +89,38 @@ more pages.
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
<Column 1 Chunk 1 + Page Metadata>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why this is Page Metadata? It's a bit confusing for me, since from ColumnChunk, we may not get the details of the page internal (like how many pages we have?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it for two reasons. First was a comment @julienledem made. The second is that since the following section on metadata now states there are two types of metadata (file metadata and page header metadata), both should appear in the diagram. Perhaps the diagram should say "Page Header Metadata", or just omit all mention of metadata and simply have <Column 1 Chunk 1>. I don't have a strong preference beyond it should no longer show "Column Metadata".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no bijection between column chunks and page headers, so I find this highly misleading.

You could make it "Optional Column Metadata" if you want stress that it's not always there, or you could indeed omit it if it's not an important feature anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the following diagram makes clear that a column chunk is a collection of pages, and pages include the page header, I'm fine with omitting all mention of metadata here.

<Column 2 Chunk 1 + Page Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
<Column N Chunk 1 + Page Metadata>
<Column 1 Chunk 2 + Page Metadata>
<Column 2 Chunk 2 + Page Metadata>
...
<Column N Chunk 2 + Column Metadata>
<Column N Chunk 2 + Page Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
<Column 1 Chunk M + Page Metadata>
<Column 2 Chunk M + Page Metadata>
...
<Column N Chunk M + Column Metadata>
<Column N Chunk M + Page Metadata>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"

In the above example, there are N columns in this table, split into M row
groups. The file metadata contains the locations of all the column metadata
groups. The file metadata contains the locations of all the column chunk
start locations. More details on what is contained in the metadata can be found
in the Thrift definition.

Metadata is written after the data to allow for single pass writing.
File Metadata is written after the data to allow for single pass writing.

Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read sequentially.

![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)

## Metadata
There are three types of metadata: file metadata, column (chunk) metadata and page
header metadata. All thrift structures are serialized using the TCompactProtocol.
There are two types of metadata: file metadata and page header metadata. All thrift structures
are serialized using the TCompactProtocol.

![Metadata diagram](https://github.com/apache/parquet-format/raw/master/doc/images/FileFormat.gif)

Expand Down
19 changes: 14 additions & 5 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -867,12 +867,21 @@ struct ColumnChunk {
**/
1: optional string file_path

/** Byte offset in file_path to the ColumnMetaData **/
2: required i64 file_offset
/** Deprecated: Byte offset in file_path to the ColumnMetaData
*
* Past use of this field has been inconsistent, with some implementations
* using it to point to the ColumnMetaData and some using it to point to
* the first page in the column chunk. In many cases, the ColumnMetaData at this
* location is wrong. This field is now deprecated and should not be used.
* Writers should set this field to 0 if no ColumnMetaData has been written outside
* the footer.
Comment on lines +876 to +877
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I don't understand would set a required field to 0 is ok and would it cracks the legacy reader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, required fields are allowed to have default values (see this for some messy details). I looked at code generated by thrift 0.21.0. Java and c++ set the field to the default value in the default constructor, and write the field regardless of its value. The generated rust code doesn't appear to set the default value, however.

It is conceivable that there is a thrift implementation that would not write the '0', which would then break an old reader that expects the field to be set. Maybe I should revert setting a default. @alkis what do you think?

Copy link
Contributor

@alkis alkis Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by the thrift spec.

From https://thrift.apache.org/docs/idl#required:

Write: Required fields are always written and are expected to be set.
Read: Required fields are always read and are expected to be contained in the input stream.
Defaults values: are always written

In any case leaving this with a default of 0 doesn't hurt. Java and C++ can stop populating this field as soon as they pick up the new thrift IDL. Rust cannot stop populating because its thrift implementation is broken. When its fixed, it can also simplify.

*/
2: required i64 file_offset = 0

/** Column metadata for this chunk. This is the same content as what is at
* file_path/file_offset. Having it here has it replicated in the file
* metadata.
/** Column metadata for this chunk. Some writers may also replicate this at the
* location pointed to by file_path/file_offset.
* Note: while marked as optional, this field is in fact required by most major
* Parquet implementations. As such, writers MUST populate this field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

**/
3: optional ColumnMetaData meta_data

Expand Down