Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bogus file offset for ColumnMetaData written to ColumnChunk metadata of single parquet files #2678

Closed
asfimport opened this issue Apr 20, 2022 · 12 comments · Fixed by #1369
Closed

Comments

@asfimport
Copy link
Collaborator

In an effort to understand the parquet format better, I've so far written my own Thrift parser, and upon examining the output, I noticed something peculiar.

To begin with, check out the definition for ColumnChunk here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

You'll notice that if there's an element 2 in the struct, this is supposed to be a file offset to where a redundant copy of the ColumnMetaData.

Next, have a look at the file called "modified.parquet" attached to https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata at the end of the file, I get this:

Struct(FileMetaData):
     1: i32(version) = I32(1)
     2: List(SchemaElement schema):
{{          ...
     3: i64(num_rows) = I64(1)
     4: List(RowGroup row_groups):
        1: Struct(RowGroup row_groups):
           1: List(ColumnChunk columns):
              1: Struct(ColumnChunk columns):
                 2: i64(file_offset) = I64(4)
                 3: Struct(ColumnMetaData meta_data):
                    1: Type(type) = I32(6) = BYTE_ARRAY
                    2: List(Encoding encodings):
                       1: Encoding(encodings) = I32(0) = PLAIN
                       2: Encoding(encodings) = I32(3) = RLE
                    3: List(string path_in_schema):
                       1: string(path_in_schema) = Binary("destination_addresses")
                       2: string(path_in_schema) = Binary("array")
                       3: string(path_in_schema) = Binary("element")
                    4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED
                    5: i64(num_values) = I64(6)
                    6: i64(total_uncompressed_size) = I64(197)
                    7: i64(total_compressed_size) = I64(197)
                    9: i64(data_page_offset) = I64(4)
}}

As you can see, element 2 of the ColumnChunk indicates that there is another copy of the ColumnMetaData at offset 4 of the file. But then we see that element 9 of the ColumnMetaData shown above indicates that the data page offset is ALSO 4, where we should find a Thrift encoding of a PageHeader structure. Obviously, both structures can't be in the same place, and in fact a PageHeader is what is located there.

Based on what I'm seeing here, I believe that element 2 of ColumnChunk should be omitted entirely in this scenario, so as to not falsely indicate that there would be another copy of the ColumnMetadata in this location in the file where indeed something else is present.

It may take me a while to locate the offending code, but I thought I'd go ahead and point this out before I set off to investigate.

Reporter: Timothy Miller / @theosib-amazon

PRs and other links:

Note: This issue was originally created as PARQUET-2139. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Timothy Miller / @theosib-amazon:
Of course, I'll be embarrassed if this turns out to just be a bug in my own thrift parser, but everything seems to line up so far.

@asfimport
Copy link
Collaborator Author

Timothy Miller / @theosib-amazon:
I've noticed a few places that could be at fault here. I'm looking at 1.13.0-SNAPSHOT, for reference.

The first is at line 513 of org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(), where we find:

      ColumnChunk columnChunk = new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset

I'm pretty sure that this is the wrong thing to put here. The same thing (columnMetaData.getFirstDataPageOffset()) is used further down when constructing the ColumnMetaData, and since that works properly, this value is evidently the location of the PageHeader, not an extra copy of the ColumnMetaData. If no extra copy of the ColumnMetaData has been specified, then the default constructor should be called. In fact, it may be that we should ALWAYS call the default constructor here, since I cannot find any place in the code where a pointer to a redundant copy of the ColumnMetaData can even be specified.

Secondly, I notice that at line 1179 of org.apache.parquet.format.ColumnChunk.write(), the FILE_OFFSET_FIELD_DESC field is always written unconditionally to the thrift encoder:

      oprot.writeFieldBegin(FILE_OFFSET_FIELD_DESC);
      oprot.writeI64(struct.file_offset);
      oprot.writeFieldEnd();

This should check and only write the field if it's nonzero.

@asfimport
Copy link
Collaborator Author

Timothy Miller / @theosib-amazon:
I just noticed that the file_offset field in ColumnChunk is "required." So there's a few possible mistakes:

  1. The description of the metadata structures is wrong and this really is supposed to be a pointer to the data page (PageHeader), or

  2. There's a completely separate bug where the parquet writer fails to store an extra copy of the ColumnMetaData in the file right before the PageHeader, or

  3. The offset should point to where the unique copy of ColumnMetaData is already going to be found in the file footer, although that seems like it would be really hard to calculate.

    In any case, there's an inconsistency where the metadata definition specifies an offset to ColumnMetaData, where instead a PageHeader is placed.

    I'm going to go check out the reader and see what it does with this field. My guess is that it doesn't use the field at all, which is why this discrepancy is never a problem.

@asfimport
Copy link
Collaborator Author

Edward Seidl / @etseidl:
In light of the recent metadata discussions on the mailing list [1] I did a similar exercise as the reporter and came to the same conclusion. It appears the file_offset is always set to the start of column chunk data, rather than the start of the encoded ColumnMetaData after the column chunk. Given that this has never led to any issues, it seems this field is not actually used by any major parquet readers.

[1] apache/parquet-format#242 (comment)

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
I just checked parquet-cpp and parquet-java:

@asfimport
Copy link
Collaborator Author

Edward Seidl / @etseidl:
@wgtmac I haven't been at my computer, so haven't been able to trace this in working code, but the suspicious line of code is

new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset

There is a ParquetMetadataConverter in ParquetFileWriter, but I don't know if it's used (I couldn't find another reference to file_offset in the parquet-java code).

I can say that of the files in parquet-testing that I've examined with a metadata parser, all those written with parquet-mr have had the file_offset field populated with the offset of the first page in the chunk. An example (meta_offset is the value of file_offset field in the ColumnChunk):

% pqmeta -f int32_with_null_pages.parquet 
FileMetaData:
  version: 1
  num_rows: 1000
  created_by: parquet-mr version 1.13.0-SNAPSHOT (build 433de8df33fcf31927f7b51456be9f53e64d48b9)
  column_orders:
     0 : TYPE_ORDER
  keyval:
     writer.model.name = example
  schema:
    schema idx:1 
      int32_field idx:2 leaf_idx:0 OPTIONAL INT32 D:1 R:0RowGroups: count = 1
Rowgroup 0: num_rows:1000 total_byte_size:3328 total_compressed_size:3328 file_offset:4
-----------
int32_field INT32 UNCOMPRESSED dict_page:None data_page:4 size:3328/3328 num_values:1000 meta_offset:4
    encodings: PLAIN, BIT_PACKED, RLE
    statistics: [min:-2136906554, max:2145722375, num_nulls:275]
    offset_index:3456/100 column_index:3332/124
    encoding_stats: [DATA_PAGE:PLAIN:10] 

Similarly, for alltypes_tiny_pages_plain.parquet

FileMetaData:
  version: 1
  num_rows: 7300
  created_by: parquet-mr version 1.12.0-SNAPSHOT (build 6901a2040848c6b37fa61f4b0a76246445f396db)
...
RowGroups: count = 1
Rowgroup : num_rows:7300 total_byte_size:676687 total_compressed_size:None file_offset:None
-----------
id INT32 UNCOMPRESSED dict_page:None data_page:4 size:37325/37325 num_values:7300 meta_offset:4
    encodings: PLAIN, RLE, BIT_PACKED
    statistics: [min:0, max:7299, num_nulls:0]
    offset_index:747419/3503 column_index:676691/3919
    encoding_stats: [DATA_PAGE:PLAIN:325]
bool_col BOOLEAN UNCOMPRESSED dict_page:None data_page:37329 size:3022/3022 num_values:7300 meta_offset:37329
    encodings: PLAIN, RLE, BIT_PACKED
    statistics: [min:False, max:True, num_nulls:0]
    offset_index:750922/823 column_index:680610/507
    encoding_stats: [DATA_PAGE:PLAIN:82] 

For both, the first file_offset is 4.

For an arrow encoded file, the file_offset seems correct. 1162 is 4 + 1158, for instance.

% pqmeta -f byte_stream_split.zstd.parquet
FileMetaData:
  version: 2
  num_rows: 300
  created_by: parquet-cpp-arrow version 14.0.2
...
Rowgroup 0: num_rows:300 total_byte_size:3726 total_compressed_size:3441 file_offset:4
-----------
f32 FLOAT ZSTD dict_page:None data_page:4 size:1158/1255 num_values:300 meta_offset:1162
    encodings: RLE, BYTE_STREAM_SPLIT
    statistics: [min:(-2.772592782974243,), max:(2.3831448554992676,), num_nulls:0]
    encoding_stats: [DATA_PAGE:BYTE_STREAM_SPLIT:1]
f64 DOUBLE ZSTD dict_page:None data_page:1230 size:2283/2471 num_values:300 meta_offset:3513
    encodings: RLE, BYTE_STREAM_SPLIT
    statistics: [min:(-3.0461430547999266,), max:(2.6962240525635797,), num_nulls:0]
    encoding_stats: [DATA_PAGE:BYTE_STREAM_SPLIT:1] 

Once I'm back in the office I can try to reproduce with the latest code.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
Good catch! I was looking for explicit call to ColumnChunk.setFile_offset(). This is definitely a bug IMHO.

@asfimport
Copy link
Collaborator Author

Edward Seidl / @etseidl:
@wgtmac I think the fix might be an easy one (if the diagnosis is correct). I can submit a PR early next week unless you (or someone else) plan to tackle this sooner.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
It would be great if you have time to fix this. I'm a little bit overwhelmed these days. :(

@asfimport
Copy link
Collaborator Author

Edward Seidl / @etseidl:
@wgtmac Do you know offhand if parquet-java ever writes ColumnMetaData at the end of a column chunk? I couldn't find code that does this. If parquet-java never writes the ColumnMetaData anywhere but in the footer, it it reasonable to set this file_offset to 0? (That's what we do in libcudf, for instance). If 0 is ok, then I have a PR pretty much ready to go.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
@etseidl I couldn't find it either.

IMO, the best fix is to write ColumnMetaData at the end of column chunk as per the spec. However, this has been incorrectly implemented since day 1 and it takes non-trivial effort to implement this in parquet-java. I think we can instead clarify this is deprecated and should not be used due to the known issue?

Perhaps it is also a good time to add a note to the spec: writing 0 or -1 to the file_offset when the ColumnMetaData is unavailable at the end of column chunk.

@asfimport
Copy link
Collaborator Author

Edward Seidl / @etseidl:
Sounds good. I'm going to submit a PR for this issue with the file_offset set to 0 for now. Separately I'll start a discussion on the M/L around modifying the spec to match what appears to be the reality on the ground...ColumnMetaData can be written after the chunk data, in the footer, or both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant