File containing a Map schema without explicitly required key #47

jupiter · 2024-04-12T16:28:02Z

To support testing of apache/arrow-rs#5630.

wgtmac · 2024-04-13T15:12:37Z

I do remember this issue. My question is whether those engines have fixed this or they are still producing optional map keys?

jupiter · 2024-04-15T11:18:51Z

I found fixes for this in:

Trino v386+ trinodb/trino@3247bd2
Presto v0.274+ prestodb/presto@842b469

Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains.

tustvold · 2024-04-15T14:58:31Z

I don't have merge rights to this repo, but it makes sense at least to me to have examples of such malformed files, much like we have files with other forms of corruption

wgtmac

+1

cc @pitrou

tustvold · 2024-04-15T15:02:19Z

FWIW pyarrow/arrow-cpp currently refuses to read this file

wgtmac · 2024-04-16T01:32:46Z

I found fixes for this in:

Trino v386+ trinodb/trino@3247bd2

Presto v0.274+ prestodb/presto@842b469

Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains.

Thanks for the search! What about linking the fixed commits to the document as well? Then readers will know that only files produced before these versions are corrupted and other engines having the same behavior should fix the issue like this.

pitrou · 2024-04-16T13:01:54Z

data/README.md

@@ -387,3 +388,40 @@ To check conformance of a `BYTE_STREAM_SPLIT` decoder, read each
 `BYTE_STREAM_SPLIT`-encoded column and compare the decoded values against
 the values from the corresponding `PLAIN`-encoded column. The values should
 be equal.
+
+## Hive Map Schema


Why "Hive"?

This was due to schema being labeled with message hive_schema {..., but if anyone's searching for this keyword, they should find it below. I'll rename this section to match the filename.

pitrou · 2024-04-16T13:02:59Z

data/README.md

+
+## Hive Map Schema
+
+A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way. 


Suggested change

A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way.

A number of producers, such as Presto/Trino/Athena, used create files with schemas

where the Map key fields are marked as optional rather than required.

This is not spec-compliant, yet appears in a number of existing data files in the wild.

pitrou · 2024-04-16T13:03:32Z

data/README.md

+
+Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains.
+
+We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console:


Suggested change

We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console:

We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630)

with relevant Presto/Trino CLI, or with AWS Athena Console:

pitrou · 2024-04-16T13:04:18Z

data/README.md

@@ -50,6 +50,7 @@
 | float16_zeros_and_nans.parquet    | Float16 (logical type) column with NaNs and zeros as min/max values. . See [note](#float16-files) below |
 | concatenated_gzip_members.parquet     | 513 UINT64 numbers compressed using 2 concatenated gzip members in a single data page |
 | byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT` encoding. See [note](#byte-stream-split) below |
+| hive-map-schema.parquet | Contains a Map schema without explicitly required keys, produced by Presto. See [note](#hive-map-schema) |


Instead of "hive", can we name this e.g. "incorrect_map_schema.parquet"?

pitrou

+1, thank you @jupiter

File containing a Map schema without explicitly required key

535cf13

jupiter marked this pull request as ready for review April 15, 2024 11:18

jupiter mentioned this pull request Apr 15, 2024

Accept parquet schemas without explicitly required Map keys apache/arrow-rs#5630

Merged

tustvold approved these changes Apr 15, 2024

View reviewed changes

wgtmac approved these changes Apr 15, 2024

View reviewed changes

Add links to fixes

8d9f362

pitrou requested changes Apr 16, 2024

View reviewed changes

jupiter requested a review from pitrou April 16, 2024 14:08

Apply suggested changes

fcc66cd

pitrou approved these changes Apr 16, 2024

View reviewed changes

pitrou merged commit 1ba3447 into apache:master Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File containing a Map schema without explicitly required key #47

File containing a Map schema without explicitly required key #47

jupiter commented Apr 12, 2024

wgtmac commented Apr 13, 2024

jupiter commented Apr 15, 2024

tustvold commented Apr 15, 2024 •

edited

Loading

wgtmac left a comment

tustvold commented Apr 15, 2024 •

edited

Loading

wgtmac commented Apr 16, 2024

pitrou Apr 16, 2024

jupiter Apr 16, 2024

pitrou Apr 16, 2024

pitrou Apr 16, 2024

pitrou Apr 16, 2024

pitrou left a comment


		## Hive Map Schema

		A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way.

-A number of producers, such as Presto/Trino/Athena, create files with schemas where the Map fields are not explicitly marked as required. An optional key is not possible according to the Parquet spec, but the schema is getting created this way.
+A number of producers, such as Presto/Trino/Athena, used create files with schemas
+where the Map key fields are marked as optional rather than required.
+This is not spec-compliant, yet appears in a number of existing data files in the wild.


		Of course it will take some time for all new files to be produced with these fixes, and the amount of existing data out there remains.

		We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console:

	We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630) with relevant Presto/Trino CLI, or with AWS Athena Console:
	We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630)
	with relevant Presto/Trino CLI, or with AWS Athena Console:

File containing a Map schema without explicitly required key #47

File containing a Map schema without explicitly required key #47

Conversation

jupiter commented Apr 12, 2024

wgtmac commented Apr 13, 2024

jupiter commented Apr 15, 2024

tustvold commented Apr 15, 2024 • edited Loading

wgtmac left a comment

Choose a reason for hiding this comment

tustvold commented Apr 15, 2024 • edited Loading

wgtmac commented Apr 16, 2024

pitrou Apr 16, 2024

Choose a reason for hiding this comment

jupiter Apr 16, 2024

Choose a reason for hiding this comment

pitrou Apr 16, 2024

Choose a reason for hiding this comment

pitrou Apr 16, 2024

Choose a reason for hiding this comment

pitrou Apr 16, 2024

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

tustvold commented Apr 15, 2024 •

edited

Loading

tustvold commented Apr 15, 2024 •

edited

Loading