-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-468: Clarify MAP logical type #469
Conversation
be `required`, `optional`, or omitted. | ||
be `required`, `optional`, or omitted. It must be placed at the 1st position | ||
of the `key_value` group if present. In case of not present, the map can be | ||
either represented with all null values or as a set of keys. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that the value can be emitted 🤔 When would a case be like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've never seen an actual use-case implemented for this one either. I am open to require having a value but since it is out there for a while now, I am not sure if we can do it. Maybe a suggestion to always specify a value and let it be an optional with all nulls if one would not use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move omitted value field to the backward compatibility section, which is all about cases that are not suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure backward compatibility section would be the best place for this. That section is mainly for supporting Parquet files that were written before the proper spec.
Anyway, it seems similar to the nested key question. @JFinis, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that the value can be emitted 🤔 When would a case be like that?
Perhaps when using a MAP
to encode a set? 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-cpp does not support set, it uses a list instead: https://github.com/apache/arrow/blob/c3601a97a0718ae47726e6c134cbed4b98bd1a36/cpp/src/parquet/arrow/schema.cc#L578-L583
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of not present, the map can be either represented with all null values or as a set of keys.
What about the below one?
In case of not present, it can be represented as a map with all null values or as a set of keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original discussion backs up this interpretation. I think it's good to leave the explanation in.
LogicalTypes.md
Outdated
|
||
Some existing data incorrectly used `MAP_KEY_VALUE` in place of `MAP`. For | ||
backward-compatibility, a group annotated with `MAP_KEY_VALUE` that is not | ||
contained by a `MAP`-annotated group should be handled as a `MAP`-annotated | ||
group. | ||
group. `MAP_KEY_VALUE` may be used for the `key_value` group. | ||
|
||
Examples that can be interpreted using these rules: | ||
|
||
``` | ||
// Map<String, Integer> (nullable map, non-null values) | ||
optional group my_map (MAP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have the one below?
optional group my_map {
repeated group map (MAP_KEY_VALUE) {
required binary str (STRING);
required int32 num;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so. At the topmost element we need either a MAP
or a MAP_KEY_VALUE
(for backward compatibility).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this? I'm just deducing possible types from the new sentence.
optional group my_map (MAP_KEY_VALUE) {
repeated group map (MAP_KEY_VALUE) {
required binary str (STRING);
required int32 num;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new sentence should not change anything. Just wanted to highlight why it was originally invented. If it causes confusion, it does not worth it.
(At the repeated level we do not say anything about its potential logical type. So, a reader should ignore anything there anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, do you think this sentence is misleading? I'm happy to remove it. The implementation do not need to check the logical type of the repeated field anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is. Especially when we have said that the name of key_value
is not required in the previous paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me too. Thanks!
While trying to make parquet-rs conform to the spec, we found we had no example data to test with. @etseidl has proposed adding a file with a However, it was hand crafted using parquet-rs. Could someone take a look, and see if this file matches your expectations? Maybe we could also help find / provide an example of such a file created by a system |
Rationale for this change
The specification of the MAP logical type is unclear in some points and makes it hard for the implementations to properly identify and handle such structures.
What changes are included in this PR?
Extend the MAP specification to be more clear.
Do these changes have PoC implementations?
Not required
Closes #468