Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File containing a Map schema without explicitly required key #47

Merged

Conversation

jupiter
Copy link
Contributor

@jupiter jupiter commented Apr 12, 2024

To support testing of apache/arrow-rs#5630.

@wgtmac
Copy link
Member

wgtmac commented Apr 13, 2024

I do remember this issue. My question is whether those engines have fixed this or they are still producing optional map keys?

@jupiter
Copy link
Contributor Author

jupiter commented Apr 15, 2024

I found fixes for this in:

Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains.

@tustvold
Copy link

tustvold commented Apr 15, 2024

I don't have merge rights to this repo, but it makes sense at least to me to have examples of such malformed files, much like we have files with other forms of corruption

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cc @pitrou

@tustvold
Copy link

tustvold commented Apr 15, 2024

FWIW pyarrow/arrow-cpp currently refuses to read this file

@wgtmac
Copy link
Member

wgtmac commented Apr 16, 2024

I found fixes for this in:

Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains.

Thanks for the search! What about linking the fixed commits to the document as well? Then readers will know that only files produced before these versions are corrupted and other engines having the same behavior should fix the issue like this.

data/README.md Outdated
@@ -387,3 +388,40 @@ To check conformance of a `BYTE_STREAM_SPLIT` decoder, read each
`BYTE_STREAM_SPLIT`-encoded column and compare the decoded values against
the values from the corresponding `PLAIN`-encoded column. The values should
be equal.

## Hive Map Schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "Hive"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was due to schema being labeled with message hive_schema {..., but if anyone's searching for this keyword, they should find it below. I'll rename this section to match the filename.

data/README.md Outdated

## Hive Map Schema

A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way.
A number of producers, such as Presto/Trino/Athena, used create files with schemas
where the Map key fields are marked as optional rather than required.
This is not spec-compliant, yet appears in a number of existing data files in the wild.

data/README.md Outdated

Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains.

We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console:
We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630)
with relevant Presto/Trino CLI, or with AWS Athena Console:

data/README.md Outdated
@@ -50,6 +50,7 @@
| float16_zeros_and_nans.parquet | Float16 (logical type) column with NaNs and zeros as min/max values. . See [note](#float16-files) below |
| concatenated_gzip_members.parquet | 513 UINT64 numbers compressed using 2 concatenated gzip members in a single data page |
| byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT` encoding. See [note](#byte-stream-split) below |
| hive-map-schema.parquet | Contains a Map schema without explicitly required keys, produced by Presto. See [note](#hive-map-schema) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "hive", can we name this e.g. "incorrect_map_schema.parquet"?

@jupiter jupiter requested a review from pitrou April 16, 2024 14:08
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @jupiter

@pitrou pitrou merged commit 1ba3447 into apache:master Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants