-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MapArrayReader
(#2484) (#1699) (#1561)
#2500
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,8 @@ pub struct MapArrayReader { | |
} | ||
|
||
impl MapArrayReader { | ||
/// Creates a new [`MapArrayReader`] with a `def_level`, `rep_level` and `nullable` | ||
/// as defined on [`ParquetField`][crate::arrow::schema::ParquetField] | ||
pub fn new( | ||
key_reader: Box<dyn ArrayReader>, | ||
value_reader: Box<dyn ArrayReader>, | ||
|
@@ -46,7 +48,9 @@ impl MapArrayReader { | |
let element = match &data_type { | ||
ArrowType::Map(element, _) => match element.data_type() { | ||
ArrowType::Struct(fields) if fields.len() == 2 => { | ||
// The inner map field must always non-nullable (#1697) | ||
// Parquet cannot represent nullability at this level (#1697) | ||
// and so encountering nullability here indicates some manner | ||
// of schema inconsistency / inference bug | ||
assert!(!element.is_nullable(), "map struct cannot be nullable"); | ||
element | ||
} | ||
|
@@ -94,6 +98,10 @@ impl ArrayReader for MapArrayReader { | |
let array = self.reader.consume_batch().unwrap(); | ||
let data = array.data().clone(); | ||
let builder = data.into_builder().data_type(self.data_type.clone()); | ||
|
||
// SAFETY - we can assume that ListArrayReader produces valid ListArray | ||
// of the expected type, and as such its output can be reinterpreted as | ||
// a MapArray without validation | ||
Ok(Arc::new(MapArray::from(unsafe { | ||
builder.build_unchecked() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pretty much, will add a doc comment |
||
}))) | ||
|
@@ -193,6 +201,8 @@ mod tests { | |
for maybe_record_batch in record_batch_reader { | ||
let record_batch = maybe_record_batch.expect("Getting current batch"); | ||
let col = record_batch.column(0); | ||
assert!(col.is_null(0)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❤️ |
||
assert!(col.is_null(1)); | ||
let map_entry = array::as_map_array(col).value(2); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this test also validate that the first two entries are null? Something like assert!(!array::as_map_array(col).is_valid(0));
assert!(!array::as_map_array(col).is_valid(1)); There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test was originally written to test my very specific failure condition. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Possibly a worthwhile addition though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, sorry I probably should have phrased that as "I suggest this test also validate the first two entries are null" |
||
let struct_col = array::as_struct_array(&map_entry); | ||
let key_col = array::as_string_array(struct_col.column(0)); // Key column | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described in #1697 the encoding of DataType::Map permits greater nullability than the type can actually contain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1697 is a question -- so is the solution "we will assume that the inner struct can not be nullable until we have an existence proof to the contrary?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it is more, if this level is nullable the schema is inconsistent as there is no way to represent that in parquet 😅 The schema inference logic will never generate this - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/complex.rs#L350
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth a comment there