-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support LargeList
in make_array
and array_length
#8121
Conversation
fn return_large_array() -> ArrayRef { | ||
// Returns: [1, 2, 3, 4] | ||
let capacity = i32::MAX as usize + 10; | ||
let args = vec![Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef; capacity]; | ||
|
||
println!("args.len() = {}", args.len()); | ||
|
||
make_array(&args).expect("failed to initialize function array") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to efficiently create a LargeList
using make_array
as it runs out of memory on my end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you should use below constructor to construct LargeList
let list_array = Arc::new(LargeListArray::from_iter_primitive::<Int32Type, _, _>(
vec![
Some(vec![Some(0), Some(1), Some(2)]),
None,
Some(vec![Some(3), None, Some(5)]),
],
)) as ArrayRef
91612a4
to
b5f9e68
Compare
fn return_large_array() -> ArrayRef { | ||
// Returns: [1, 2, 3, 4] | ||
let args = [ | ||
Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef, | ||
Arc::new(Int64Array::from(vec![Some(2)])) as ArrayRef, | ||
Arc::new(Int64Array::from(vec![Some(3)])) as ArrayRef, | ||
Arc::new(Int64Array::from(vec![Some(4)])) as ArrayRef, | ||
]; | ||
let data_type = DataType::Int64; | ||
array_array::<i64>(&args, data_type).expect("failed to initialize function array") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hack here to create a LargeList
to avoid spending too much memory for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the hack?
f7bdb8f
to
a06fbf8
Compare
a06fbf8
to
d340ff7
Compare
@alamb @jayzhan211 @Veeupup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @Weijun-H -- the basic structure of most of this PR looks good to me -- thank you. I left some comments, but I think it is pretty close.
Perhaps @jayzhan211 has some additional thoughts to share
datafusion/common/src/scalar.rs
Outdated
@@ -1716,7 +1716,11 @@ impl ScalarValue { | |||
} else { | |||
Self::iter_to_array(values.iter().cloned()).unwrap() | |||
}; | |||
Arc::new(array_into_list_array(values)) | |||
if values.len() <= i32::MAX as usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ScalarValue type needs to match the underyling array type -- so with this change, a ScalarValue::List
might return LargeListArray
or ListArray
. This mismatch will likely cause issues downstream
I think the standard pattern would be to add a new function ScalarValue::new_large_list
and ScalarValue::LargeList
if they don't already exist
fn return_large_array() -> ArrayRef { | ||
// Returns: [1, 2, 3, 4] | ||
let args = [ | ||
Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef, | ||
Arc::new(Int64Array::from(vec![Some(2)])) as ArrayRef, | ||
Arc::new(Int64Array::from(vec![Some(3)])) as ArrayRef, | ||
Arc::new(Int64Array::from(vec![Some(4)])) as ArrayRef, | ||
]; | ||
let data_type = DataType::Int64; | ||
array_array::<i64>(&args, data_type).expect("failed to initialize function array") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the hack?
@@ -2939,6 +3011,43 @@ mod tests { | |||
as_uint64_array(&arr).expect("failed to initialize function array_length"); | |||
|
|||
assert_eq!(result, &UInt64Array::from(vec![None])); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than extending this module with more rust tests, can you perhaps add SQL level tests using sqllogictest?
That will ensure the functions are usable end to end and that all the connections are in place
Perhaps https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/array.slt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to create an large list in sql file. I think it's why he done the test here? @Weijun-H
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I do not know how we could construct the LargeList
in sqllogictest, instead of an array with more i32:MAX
by make_array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current idea is to use arrow_cast()
to create a LargeList instead of make_array
. This pr is stalled by #8290 @alamb @jayzhan211
offsets.push(0); | ||
|
||
let mut offsets: Vec<O> = Vec::with_capacity(total_len); | ||
offsets.push(O::from_usize(0).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usize_as
Arc::new(Field::new("item", data_type, true)), | ||
OffsetBuffer::new(offsets.into()), | ||
OffsetBuffer::new(ScalarBuffer::from(offsets)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need this for usize_as
match &args[0].data_type() { | ||
DataType::List(_) => _array_length_list::<i32>(args), | ||
DataType::LargeList(_) => _array_length_list::<i64>(args), | ||
_ => Err(DataFusionError::Internal(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved
} | ||
|
||
/// array_length for List and LargeList | ||
fn _array_length_list<O: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any other name without _? In rust _xxx usually means unused.
818685c
to
6839aaa
Compare
# array_length scalar function #4 | ||
query II | ||
select array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 1), array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 2); | ||
---- | ||
3 2 | ||
|
||
query II | ||
select array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 1), array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can not test LargeList(LargeList)
because arrow-rs does not support it yet. #8305
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to add a test that tries to cast LargeList(List)
as a follow on PR? It would, of course, error initially, but then when we upgraded arrow to a version that did support that cast, we would have the test coverage
6839aaa
to
5d5d0da
Compare
@@ -355,11 +364,11 @@ fn array_array(args: &[ArrayRef], data_type: DataType) -> Result<ArrayRef> { | |||
mutable.extend_nulls(1); | |||
} | |||
} | |||
offsets.push(mutable.len() as i32); | |||
offsets.push(O::from_usize(mutable.len()).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we dont need unwrap if casting via usize_as
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
Thank you for reviewing 👍 @jayzhan211 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
0e58ca8
to
5d678a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @Weijun-H -- I think it is very close
match data_type { | ||
// Either an empty array or all nulls: | ||
DataType::Null => { | ||
let array = new_null_array(&DataType::Null, arrays.len()); | ||
Ok(Arc::new(array_into_list_array(array))) | ||
if len <= i32::MAX as usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think that make_array
should ever return NullArray
(aka DataType::Null
) and instead it should always return ListArray
(possibly with a null value)
However, this PR doesn't make this situation worse, so 👍
array_length
95a8396
to
05e365a
Compare
0b0e475
to
6caf58f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @Weijun-H
# array_length scalar function #4 | ||
query II | ||
select array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 1), array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 2); | ||
---- | ||
3 2 | ||
|
||
query II | ||
select array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 1), array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to add a test that tries to cast LargeList(List)
as a follow on PR? It would, of course, error initially, but then when we upgraded arrow to a version that did support that cast, we would have the test coverage
THanks for sticking with these PRs -- I know it has taken time, but now that we have the patterns down I feel like the code is really improving at a nice rate. This is what I think successful software development and incremental improvement looks like! |
…8121) * feat: support LargeList in make_array and array_length * chore: add tests * fix: update tests for nested array * use usise_as * add new_large_list * refactor array_length * add comment * update test in sqllogictest * fix ci * fix macro * use usize_as * update comment * return based on data_type in make_array
Which issue does this PR close?
Parts #8084
Parts #8185
Rationale for this change
The current implementation of the
make_array
andarray_length
functions lack support forLargeList
. Therefore, we need to enhance these functions by incorporating theOffsizeTrait
to enable them to work seamlessly withLargeList
. This will ensure that the functions can handle a large number of elements in the list, thus improving their overall efficiency.What changes are included in this PR?
make_array
can beList
andLargeList
depending on the array lengthLargeList
inarray_lenght
array_array
andarray!
can considerOffsizeTrait
Are these changes tested?
Are there any user-facing changes?