Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MapArrayReader (#2484) (#1699) (#1561) #2500

Merged
merged 3 commits into from
Aug 18, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion parquet/src/arrow/array_reader/map_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ pub struct MapArrayReader {
}

impl MapArrayReader {
/// Creates a new [`MapArrayReader`] with a `def_level`, `rep_level` and `nullable`
/// as defined on [`ParquetField`][crate::arrow::schema::ParquetField]
pub fn new(
key_reader: Box<dyn ArrayReader>,
value_reader: Box<dyn ArrayReader>,
Expand All @@ -46,7 +48,9 @@ impl MapArrayReader {
let element = match &data_type {
ArrowType::Map(element, _) => match element.data_type() {
ArrowType::Struct(fields) if fields.len() == 2 => {
// The inner map field must always non-nullable (#1697)
// Parquet cannot represent nullability at this level (#1697)
// and so encountering nullability here indicates some manner
// of schema inconsistency / inference bug
assert!(!element.is_nullable(), "map struct cannot be nullable");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in #1697 the encoding of DataType::Map permits greater nullability than the type can actually contain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1697 is a question -- so is the solution "we will assume that the inner struct can not be nullable until we have an existence proof to the contrary?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it is more, if this level is nullable the schema is inconsistent as there is no way to represent that in parquet 😅 The schema inference logic will never generate this - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/complex.rs#L350

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a comment there

element
}
Expand Down Expand Up @@ -94,6 +98,10 @@ impl ArrayReader for MapArrayReader {
let array = self.reader.consume_batch().unwrap();
let data = array.data().clone();
let builder = data.into_builder().data_type(self.data_type.clone());

// SAFETY - we can assume that ListArrayReader produces valid ListArray
// of the expected type, and as such its output can be reinterpreted as
// a MapArray without validation
Ok(Arc::new(MapArray::from(unsafe {
builder.build_unchecked()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unsafe is OK because the inner StructArrayReader already validated the contents of the structs, and MapArrayReader::new() already validated the DataType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much, will add a doc comment

})))
Expand Down Expand Up @@ -193,6 +201,8 @@ mod tests {
for maybe_record_batch in record_batch_reader {
let record_batch = maybe_record_batch.expect("Getting current batch");
let col = record_batch.column(0);
assert!(col.is_null(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

assert!(col.is_null(1));
let map_entry = array::as_map_array(col).value(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this test also validate that the first two entries are null? Something like

assert!(!array::as_map_array(col).is_valid(0));
assert!(!array::as_map_array(col).is_valid(1));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was originally written to test my very specific failure condition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a worthwhile addition though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry I probably should have phrased that as "I suggest this test also validate the first two entries are null"

let struct_col = array::as_struct_array(&map_entry);
let key_col = array::as_string_array(struct_col.column(0)); // Key column
Expand Down