Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MapArray Additional Nullability #1697

Open
tustvold opened this issue May 13, 2022 · 2 comments
Open

MapArray Additional Nullability #1697

tustvold opened this issue May 13, 2022 · 2 comments
Labels
question Further information is requested

Comments

@tustvold
Copy link
Contributor

Which part is this question about

A MapArray is represented in parquet as follows, see here.

<map-repetition> group <name> (MAP) {
  repeated group key_value {
    required <key-type> key;
    <value-repetition> <value-type> value;
  }
}

However, it is represented in arrow as

Field::new(
    <name>,
    DataType::Map(
        Box::new(Field::new("entries", DataType::Struct(..), <nullable>))
        <is_sorted>
    )
   <nullable>
)

The confusion lies in the fact we now have nullability at three levels:

  • Within the MapArray
  • Within the StructArray
  • Within the key and value arrays

When the original parquet data only has nullability at two levels.

The writer logic appears to ignore the nullability of the inner StructArray, which doesn't feel right.

Describe your question

  • Should the inner StructArray always be not nullable?
  • Could we store DataType instead of Field in DataType::Map
  • Could we store the child DataType directly in DataType::Map

Additional context

#1642 also relates to the semantics of MapArrays

Map Arrays were added by @nevi-me as part of #395

@tustvold tustvold added the question Further information is requested label May 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 13, 2022
alamb pushed a commit that referenced this issue May 13, 2022
* Separate parquet -> arrow conversion logic (#1655)

Don't treat embedded arrow schema as authoritative (#1663)

Fix projection of nested parquet files (#1652) (#1654)

Fix schema inference for repeated fields (#1681)

Support reading alternative list representations from parquet (#1680)

* Add more tests

* Pass pointers by reference

* More docs

* Fix lint

* Review feedback

* Review feedback

* Fix test failures related to #1697
cmackenzie1 added a commit to cmackenzie1-contrib/delta-rs that referenced this issue Sep 8, 2023
This value was true but where arrow defines it as always false
https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L230.

This is also described in apache/arrow-rs#1697.
cmackenzie1 added a commit to cmackenzie1-contrib/delta-rs that referenced this issue Sep 8, 2023
This value was true but where arrow defines it as always false
https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L230.

This is also described in apache/arrow-rs#1697.

This also replaces `key_value` as the struct name with `entries` to remain
consistent with https://github.com/apache/arrow-rs/blob/878217b9e330b4f1ed13e798a214ea11fbeb2bbb/arrow-schema/src/datatype.rs#L247-L250
@wjones127
Copy link
Member

FYI, the arrow spec says:

Neither the "entries" field nor the "key" field may be nullable.

So it's safe to ignore the nullability at the struct level and for the key field.

cmackenzie1 added a commit to cmackenzie1-contrib/delta-rs that referenced this issue Sep 10, 2023
This value was true but where arrow defines it as always false
https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L230.

This is also described in apache/arrow-rs#1697.

This also replaces `key_value` as the struct name with `entries` to remain
consistent with https://github.com/apache/arrow-rs/blob/878217b9e330b4f1ed13e798a214ea11fbeb2bbb/arrow-schema/src/datatype.rs#L247-L250
wjones127 added a commit to delta-io/delta-rs that referenced this issue Sep 11, 2023
# Description

This value was true but where arrow defines it as always false
https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L230.

This is also described in apache/arrow-rs#1697.

This also replaces `key_value` as the struct name with `entries` to
remain consistent with
https://github.com/apache/arrow-rs/blob/878217b9e330b4f1ed13e798a214ea11fbeb2bbb/arrow-schema/src/datatype.rs#L247-L250

The description of the main changes of your pull request

# Related Issue(s)

- closes #1619 

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Will Jones <[email protected]>
@ByteBaker
Copy link
Contributor

@alamb can you please review this once to confirm if this still has anything remaining?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants