Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UnionArray::into_parts #5585

Merged
merged 5 commits into from
Apr 9, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions arrow-array/src/array/union_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,39 @@ impl UnionArray {
fields,
}
}

/// Deconstruct this array into its constituent parts
///
/// # Example
///
/// ```
/// # use arrow_array::types::Int32Type;
/// # use arrow_array::builder::UnionBuilder;
/// # fn main() -> Result<(), arrow_schema::ArrowError> {
/// let mut builder = UnionBuilder::new_dense();
/// builder.append::<Int32Type>("a", 1).unwrap();
/// let union_array = builder.build()?;
/// let (data_type, type_ids, offsets, fields) = union_array.into_parts();
/// # Ok(())
/// # }
/// ```
#[allow(clippy::type_complexity)]
pub fn into_parts(
self,
) -> (
DataType,
ScalarBuffer<i8>,
Option<ScalarBuffer<i32>>,
Vec<Option<ArrayRef>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exposing what is effectively an implementation detail of UnionArray, that it uses a sparse vec to encode type ids. We may want to revisit this in future, e.g. to better handle large type ids

) {
let Self {
data_type,
type_ids,
offsets,
fields,
} = self;
(data_type, type_ids, offsets, fields)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems PrimitiveArray is the only other array type that returns a DataType as part of its into_parts()

Maybe just return the fields, akin to how StructArray does it? (Though unsure about the Sparse/Dense enum)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. In that case I think we should also return UnionMode (instead of users having to check if the offsets buffer for Some or None to figure that out):

(UnionFields, UnionMode, ScalarBuffer<i8>, Option<ScalarBuffer<i32>>, Vec<Option<ArrayRef>>)

}

impl From<ArrayData> for UnionArray {
Expand Down Expand Up @@ -505,6 +538,7 @@ impl std::fmt::Debug for UnionArray {
mod tests {
use super::*;

use crate::array::Int8Type;
use crate::builder::UnionBuilder;
use crate::cast::AsArray;
use crate::types::{Float32Type, Float64Type, Int32Type, Int64Type};
Expand Down Expand Up @@ -1201,4 +1235,74 @@ mod tests {
assert_eq!(v.len(), 1);
assert_eq!(v.as_string::<i32>().value(0), "baz");
}

#[test]
fn into_parts() {
let mut builder = UnionBuilder::new_dense();
builder.append::<Int32Type>("a", 1).unwrap();
builder.append::<Int8Type>("b", 2).unwrap();
builder.append::<Int32Type>("a", 3).unwrap();
let dense_union = builder.build().unwrap();

let field = [
Field::new("a", DataType::Int32, false),
Field::new("b", DataType::Int8, false),
];
let field_type_ids = [0, 1];
let (data_type, type_ids, offsets, fields) = dense_union.into_parts();
assert_eq!(
data_type,
DataType::Union(
UnionFields::new(field_type_ids, field.clone()),
UnionMode::Dense
)
);
assert_eq!(type_ids, [0, 1, 0]);
assert!(offsets.is_some());
assert_eq!(offsets.as_ref().unwrap(), &[0, 0, 1]);
assert_eq!(fields.len(), 2);

let result = UnionArray::try_new(
&[0, 1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth showing this works with type ids of [6, 2] or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type_ids.into_inner(),
offsets.map(ScalarBuffer::into_inner),
field
.clone()
.into_iter()
.zip(fields.into_iter().flatten())
.collect(),
);
assert!(result.is_ok());
assert_eq!(result.unwrap().len(), 3);

let mut builder = UnionBuilder::new_sparse();
builder.append::<Int32Type>("a", 1).unwrap();
builder.append::<Int8Type>("b", 2).unwrap();
builder.append::<Int32Type>("a", 3).unwrap();
let sparse_union = builder.build().unwrap();

let (data_type, type_ids, offsets, fields) = sparse_union.into_parts();
assert_eq!(
data_type,
DataType::Union(
UnionFields::new(field_type_ids, field.clone()),
UnionMode::Sparse
)
);
assert_eq!(type_ids, [0, 1, 0]);
assert!(offsets.is_none());
assert_eq!(fields.len(), 2);

let result = UnionArray::try_new(
&[0, 1],
type_ids.into_inner(),
offsets.map(ScalarBuffer::into_inner),
field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flattening should probably be an implementation detail of into_parts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the fields argument of into_parts could be changed to impl Iterator<Item = (i8, ArrayRef)> to hide implementation details?

Copy link
Contributor

@tustvold tustvold Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should probably reconstruct the arrays in the order expected by the constructor and return these as a Vec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. The constructor expects the order (of the child arrays) to match the provided fields ids, however, if unknown, one could get these type_ids from the UnionFields iterator, which has no order guarantee:

/// Returns an iterator over the fields and type ids in this [`UnionFields`]
///
/// Note: the iteration order is not guaranteed
pub fn iter(&self) -> impl Iterator<Item = (i8, &FieldRef)> + '_ {
self.0.iter().map(|(id, f)| (*id, f))
}

.into_iter()
.zip(fields.into_iter().flatten())
.collect(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct if the type ids in the signature are out of order

);
assert!(result.is_ok());
assert_eq!(result.unwrap().len(), 3);
}
}
Loading