-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File containing a Map schema without explicitly required key #47
File containing a Map schema without explicitly required key #47
Conversation
I do remember this issue. My question is whether those engines have fixed this or they are still producing optional map keys? |
I found fixes for this in:
Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains. |
I don't have merge rights to this repo, but it makes sense at least to me to have examples of such malformed files, much like we have files with other forms of corruption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
cc @pitrou
FWIW pyarrow/arrow-cpp currently refuses to read this file |
Thanks for the search! What about linking the fixed commits to the document as well? Then readers will know that only files produced before these versions are corrupted and other engines having the same behavior should fix the issue like this. |
data/README.md
Outdated
@@ -387,3 +388,40 @@ To check conformance of a `BYTE_STREAM_SPLIT` decoder, read each | |||
`BYTE_STREAM_SPLIT`-encoded column and compare the decoded values against | |||
the values from the corresponding `PLAIN`-encoded column. The values should | |||
be equal. | |||
|
|||
## Hive Map Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "Hive"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was due to schema being labeled with message hive_schema {...
, but if anyone's searching for this keyword, they should find it below. I'll rename this section to match the filename.
data/README.md
Outdated
|
||
## Hive Map Schema | ||
|
||
A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way. | |
A number of producers, such as Presto/Trino/Athena, used create files with schemas | |
where the Map key fields are marked as optional rather than required. | |
This is not spec-compliant, yet appears in a number of existing data files in the wild. |
data/README.md
Outdated
|
||
Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains. | ||
|
||
We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console: | |
We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) | |
with relevant Presto/Trino CLI, or with AWS Athena Console: |
data/README.md
Outdated
@@ -50,6 +50,7 @@ | |||
| float16_zeros_and_nans.parquet | Float16 (logical type) column with NaNs and zeros as min/max values. . See [note](#float16-files) below | | |||
| concatenated_gzip_members.parquet | 513 UINT64 numbers compressed using 2 concatenated gzip members in a single data page | | |||
| byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT` encoding. See [note](#byte-stream-split) below | | |||
| hive-map-schema.parquet | Contains a Map schema without explicitly required keys, produced by Presto. See [note](#hive-map-schema) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "hive", can we name this e.g. "incorrect_map_schema.parquet"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @jupiter
To support testing of apache/arrow-rs#5630.