Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2139: fix file_offset field in ColumnChunk metadata #1369

Merged
merged 1 commit into from
Jul 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -534,8 +534,9 @@ private void addRowGroup(
int columnOrdinal = -1;
ByteArrayOutputStream tempOutStream = null;
for (ColumnChunkMetaData columnMetaData : columns) {
ColumnChunk columnChunk =
new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset
// There is no ColumnMetaData written after the chunk data, so set the ColumnChunk
// file_offset to 0
ColumnChunk columnChunk = new ColumnChunk(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that @etseidl and me have discussed in https://issues.apache.org/jira/browse/PARQUET-2139. The best fix is to write ColumnMetaData at the end of each column chunk (currently it does not) and store the correct offset here. However, it has been wrong since day 1 and takes some effort to make it right. Since we have not seen any issue around this these years, I'm inclined to deprecate this field together with the v3 discussion. Therefore I'm fine with setting an invalid value here (0 or -1). WDYT? @gszadovszky @julienledem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also brought this up on the mailing list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, I agree to write invalid value here (0 is as invalid as -1 because of the magic bytes at the beginning of the file) and remove the field for v3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we intended to write the ColumnMetaData at the end of the column chunk though. Is it something that is ambiguous in the spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I followed up on the mailing list on the thread above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-cpp actually writes ColumnMetaData right after the last page of the column and stores it into file_offset field: https://github.com/apache/arrow/blob/6800be9331d88024bf550c77865a06c592a22699/cpp/src/parquet/metadata.cc#L1473-L1478

columnChunk.file_path = block.getPath(); // they are in the same file for now
InternalColumnEncryptionSetup columnSetup = null;
boolean writeCryptoMetadata = false;
Expand Down