Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-468: Clarify MAP logical type #469

Merged
merged 4 commits into from
Nov 12, 2024
Merged

Conversation

gszadovszky
Copy link
Contributor

Rationale for this change

The specification of the MAP logical type is unclear in some points and makes it hard for the implementations to properly identify and handle such structures.

What changes are included in this PR?

Extend the MAP specification to be more clear.

Do these changes have PoC implementations?

Not required

Closes #468

LogicalTypes.md Outdated Show resolved Hide resolved
be `required`, `optional`, or omitted.
be `required`, `optional`, or omitted. It must be placed at the 1st position
of the `key_value` group if present. In case of not present, the map can be
either represented with all null values or as a set of keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the value can be emitted 🤔 When would a case be like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've never seen an actual use-case implemented for this one either. I am open to require having a value but since it is out there for a while now, I am not sure if we can do it. Maybe a suggestion to always specify a value and let it be an optional with all nulls if one would not use it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move omitted value field to the backward compatibility section, which is all about cases that are not suggested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure backward compatibility section would be the best place for this. That section is mainly for supporting Parquet files that were written before the proper spec.
Anyway, it seems similar to the nested key question. @JFinis, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the value can be emitted 🤔 When would a case be like that?

Perhaps when using a MAP to encode a set? 🤷

Copy link
Member

@wgtmac wgtmac Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, @etseidl, WDYT, shall I simply remove this last sentence about handling the case of "missing" values? I don't think we can change the original spec because it is clearly stated that value can be omitted. Then, it is up to the implementation how they support/not support this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of not present, the map can be either represented with all null values or as a set of keys.

What about the below one?

In case of not present, it can be represented as a map with all null values or as a set of keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original discussion backs up this interpretation. I think it's good to leave the explanation in.

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated

Some existing data incorrectly used `MAP_KEY_VALUE` in place of `MAP`. For
backward-compatibility, a group annotated with `MAP_KEY_VALUE` that is not
contained by a `MAP`-annotated group should be handled as a `MAP`-annotated
group.
group. `MAP_KEY_VALUE` may be used for the `key_value` group.

Examples that can be interpreted using these rules:

```
// Map<String, Integer> (nullable map, non-null values)
optional group my_map (MAP) {
Copy link
Member

@wgtmac wgtmac Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have the one below?

optional group my_map {
  repeated group map (MAP_KEY_VALUE) {
    required binary str (STRING);
    required int32 num;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so. At the topmost element we need either a MAP or a MAP_KEY_VALUE (for backward compatibility).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this? I'm just deducing possible types from the new sentence.

optional group my_map (MAP_KEY_VALUE) {
  repeated group map (MAP_KEY_VALUE) {
    required binary str (STRING);
    required int32 num;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new sentence should not change anything. Just wanted to highlight why it was originally invented. If it causes confusion, it does not worth it.
(At the repeated level we do not say anything about its potential logical type. So, a reader should ignore anything there anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, do you think this sentence is misleading? I'm happy to remove it. The implementation do not need to check the logical type of the repeated field anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is. Especially when we have said that the name of key_value is not required in the previous paragraph.

@gszadovszky
Copy link
Contributor Author

@wgtmac, @etseidl, I think the current version is according to the results of the discussions. Could you double check and approve if so?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too. Thanks!

@gszadovszky gszadovszky merged commit c7cb964 into apache:master Nov 12, 2024
3 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 19, 2024

While trying to make parquet-rs conform to the spec, we found we had no example data to test with.

@etseidl has proposed adding a file with a map that has no value in

However, it was hand crafted using parquet-rs. Could someone take a look, and see if this file matches your expectations? Maybe we could also help find / provide an example of such a file created by a system

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Logical type MAP specification is unclear
7 participants