-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bogus file offset for ColumnMetaData written to ColumnChunk metadata of single parquet files #2678
Comments
Timothy Miller / @theosib-amazon: |
Timothy Miller / @theosib-amazon: The first is at line 513 of org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(), where we find:
I'm pretty sure that this is the wrong thing to put here. The same thing (columnMetaData.getFirstDataPageOffset()) is used further down when constructing the ColumnMetaData, and since that works properly, this value is evidently the location of the PageHeader, not an extra copy of the ColumnMetaData. If no extra copy of the ColumnMetaData has been specified, then the default constructor should be called. In fact, it may be that we should ALWAYS call the default constructor here, since I cannot find any place in the code where a pointer to a redundant copy of the ColumnMetaData can even be specified. Secondly, I notice that at line 1179 of org.apache.parquet.format.ColumnChunk.write(), the FILE_OFFSET_FIELD_DESC field is always written unconditionally to the thrift encoder:
This should check and only write the field if it's nonzero. |
Timothy Miller / @theosib-amazon:
|
Edward Seidl / @etseidl: |
Gang Wu / @wgtmac:
|
Edward Seidl / @etseidl: Line 538 in dd9014f
There is a ParquetMetadataConverter in ParquetFileWriter, but I don't know if it's used (I couldn't find another reference to file_offset in the parquet-java code). I can say that of the files in parquet-testing that I've examined with a metadata parser, all those written with parquet-mr have had the file_offset field populated with the offset of the first page in the chunk. An example (meta_offset is the value of file_offset field in the ColumnChunk): % pqmeta -f int32_with_null_pages.parquet
FileMetaData:
version: 1
num_rows: 1000
created_by: parquet-mr version 1.13.0-SNAPSHOT (build 433de8df33fcf31927f7b51456be9f53e64d48b9)
column_orders:
0 : TYPE_ORDER
keyval:
writer.model.name = example
schema:
schema idx:1
int32_field idx:2 leaf_idx:0 OPTIONAL INT32 D:1 R:0RowGroups: count = 1
Rowgroup 0: num_rows:1000 total_byte_size:3328 total_compressed_size:3328 file_offset:4
-----------
int32_field INT32 UNCOMPRESSED dict_page:None data_page:4 size:3328/3328 num_values:1000 meta_offset:4
encodings: PLAIN, BIT_PACKED, RLE
statistics: [min:-2136906554, max:2145722375, num_nulls:275]
offset_index:3456/100 column_index:3332/124
encoding_stats: [DATA_PAGE:PLAIN:10] Similarly, for alltypes_tiny_pages_plain.parquet FileMetaData:
version: 1
num_rows: 7300
created_by: parquet-mr version 1.12.0-SNAPSHOT (build 6901a2040848c6b37fa61f4b0a76246445f396db)
...
RowGroups: count = 1
Rowgroup : num_rows:7300 total_byte_size:676687 total_compressed_size:None file_offset:None
-----------
id INT32 UNCOMPRESSED dict_page:None data_page:4 size:37325/37325 num_values:7300 meta_offset:4
encodings: PLAIN, RLE, BIT_PACKED
statistics: [min:0, max:7299, num_nulls:0]
offset_index:747419/3503 column_index:676691/3919
encoding_stats: [DATA_PAGE:PLAIN:325]
bool_col BOOLEAN UNCOMPRESSED dict_page:None data_page:37329 size:3022/3022 num_values:7300 meta_offset:37329
encodings: PLAIN, RLE, BIT_PACKED
statistics: [min:False, max:True, num_nulls:0]
offset_index:750922/823 column_index:680610/507
encoding_stats: [DATA_PAGE:PLAIN:82] For both, the first file_offset is 4. For an arrow encoded file, the file_offset seems correct. 1162 is 4 + 1158, for instance. % pqmeta -f byte_stream_split.zstd.parquet
FileMetaData:
version: 2
num_rows: 300
created_by: parquet-cpp-arrow version 14.0.2
...
Rowgroup 0: num_rows:300 total_byte_size:3726 total_compressed_size:3441 file_offset:4
-----------
f32 FLOAT ZSTD dict_page:None data_page:4 size:1158/1255 num_values:300 meta_offset:1162
encodings: RLE, BYTE_STREAM_SPLIT
statistics: [min:(-2.772592782974243,), max:(2.3831448554992676,), num_nulls:0]
encoding_stats: [DATA_PAGE:BYTE_STREAM_SPLIT:1]
f64 DOUBLE ZSTD dict_page:None data_page:1230 size:2283/2471 num_values:300 meta_offset:3513
encodings: RLE, BYTE_STREAM_SPLIT
statistics: [min:(-3.0461430547999266,), max:(2.6962240525635797,), num_nulls:0]
encoding_stats: [DATA_PAGE:BYTE_STREAM_SPLIT:1] Once I'm back in the office I can try to reproduce with the latest code. |
Edward Seidl / @etseidl: |
Edward Seidl / @etseidl: |
Gang Wu / @wgtmac: IMO, the best fix is to write ColumnMetaData at the end of column chunk as per the spec. However, this has been incorrectly implemented since day 1 and it takes non-trivial effort to implement this in parquet-java. I think we can instead clarify this is deprecated and should not be used due to the known issue? Perhaps it is also a good time to add a note to the spec: writing 0 or -1 to the file_offset when the ColumnMetaData is unavailable at the end of column chunk. |
Edward Seidl / @etseidl: |
In an effort to understand the parquet format better, I've so far written my own Thrift parser, and upon examining the output, I noticed something peculiar.
To begin with, check out the definition for ColumnChunk here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
You'll notice that if there's an element 2 in the struct, this is supposed to be a file offset to where a redundant copy of the ColumnMetaData.
Next, have a look at the file called "modified.parquet" attached to https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata at the end of the file, I get this:
Struct(FileMetaData):
1: i32(version) = I32(1)
2: List(SchemaElement schema):
{{ ...
3: i64(num_rows) = I64(1)
4: List(RowGroup row_groups):
1: Struct(RowGroup row_groups):
1: List(ColumnChunk columns):
1: Struct(ColumnChunk columns):
2: i64(file_offset) = I64(4)
3: Struct(ColumnMetaData meta_data):
1: Type(type) = I32(6) = BYTE_ARRAY
2: List(Encoding encodings):
1: Encoding(encodings) = I32(0) = PLAIN
2: Encoding(encodings) = I32(3) = RLE
3: List(string path_in_schema):
1: string(path_in_schema) = Binary("destination_addresses")
2: string(path_in_schema) = Binary("array")
3: string(path_in_schema) = Binary("element")
4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED
5: i64(num_values) = I64(6)
6: i64(total_uncompressed_size) = I64(197)
7: i64(total_compressed_size) = I64(197)
9: i64(data_page_offset) = I64(4)
}}
As you can see, element 2 of the ColumnChunk indicates that there is another copy of the ColumnMetaData at offset 4 of the file. But then we see that element 9 of the ColumnMetaData shown above indicates that the data page offset is ALSO 4, where we should find a Thrift encoding of a PageHeader structure. Obviously, both structures can't be in the same place, and in fact a PageHeader is what is located there.
Based on what I'm seeing here, I believe that element 2 of ColumnChunk should be omitted entirely in this scenario, so as to not falsely indicate that there would be another copy of the ColumnMetadata in this location in the file where indeed something else is present.
It may take me a while to locate the offending code, but I thought I'd go ahead and point this out before I set off to investigate.
Reporter: Timothy Miller / @theosib-amazon
PRs and other links:
Note: This issue was originally created as PARQUET-2139. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: