-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Make array_reverse faster for List and FixedSizeList #18500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cc @Jefffrey 👀 |
| // Materialize values from underlying array with take | ||
| let indices_array: ArrayRef = if O::IS_LARGE { | ||
| Arc::new(UInt64Array::from( | ||
| indices | ||
| .iter() | ||
| .map(|i| i.as_usize() as u64) | ||
| .collect::<Vec<_>>(), | ||
| )) | ||
| } else { | ||
| Arc::new(UInt32Array::from( | ||
| indices | ||
| .iter() | ||
| .map(|i| i.as_usize() as u32) | ||
| .collect::<Vec<_>>(), | ||
| )) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated for ListView. I figured twice was not enough to extract to a function, but if we find a nicer way to do it, we can improve both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we figured out a nicer way of doing this but I still can't figure it out 😅
Went a bit crazy with generics in attempting once but didn't pan out; but it works and the if branch isn't a big deal, though I wonder if it's more efficient to just have Int32/Int64 arrays instead of their unsigned variants to avoid needing a map -> collect 🤔
7f84c0e to
38af34a
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note regarding benchmarks, we might need to add a bit of randomness + null spread for more accurate benchmarking perhaps. As they currently are I believe they always create arrays of fixed offsets for every child list.
| // Materialize values from underlying array with take | ||
| let indices_array: ArrayRef = if O::IS_LARGE { | ||
| Arc::new(UInt64Array::from( | ||
| indices | ||
| .iter() | ||
| .map(|i| i.as_usize() as u64) | ||
| .collect::<Vec<_>>(), | ||
| )) | ||
| } else { | ||
| Arc::new(UInt32Array::from( | ||
| indices | ||
| .iter() | ||
| .map(|i| i.as_usize() as u32) | ||
| .collect::<Vec<_>>(), | ||
| )) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we figured out a nicer way of doing this but I still can't figure it out 😅
Went a bit crazy with generics in attempting once but didn't pan out; but it works and the if branch isn't a big deal, though I wonder if it's more efficient to just have Int32/Int64 arrays instead of their unsigned variants to avoid needing a map -> collect 🤔
| let mut mutable = | ||
| MutableArrayData::with_capacities(vec![&original_data], false, capacity); | ||
| let value_length = array.value_length() as usize; | ||
| let mut indices: Vec<u64> = Vec::with_capacity(values.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder for FSL's if we can take advantage of knowing its a fixed size list to do it in a more efficient way? For example we know a FSL like so:
FSL with fixed length 3:
[[a, b, c], [d, e, f]]
Would always have reverse offsets like so:
[2, 1, 0, 5, 4, 3]
Without even needing to iterate the FSL array itself 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very elegant! Included in 41e1044
I've started on this but didn't finish -- I couldn't figure out how to properly set a random seed. Any pointers? 🤔 |
I don't know exactly what you mean, but maybe you can use the map benchmark as a reference? datafusion/datafusion/functions-nested/benches/map.rs Lines 57 to 59 in 7591919
They don't seem to set the seed. |
Thanks for the pointer! I thought I would need to set the seed because I was getting very variable results. |
|
Updated benchmark results after reworking those cc @Jefffrey Baseline is main implementation of reverse but new benchmark. I didn't end up using rng but just setting variable array sizes and nullability (see 388ab48). I reckon that's fine! Reversing the indexes directly for FixedSizeList was ~10% better (from 41e1044) than the previous thing I had, so thanks @Jefffrey! |
|
Thanks! |
Rationale for this change
Noticed while doing #18424 that the list types
ListandFixedSizeListusesMutableDatato build the reverse array. Usingtaketurns out to be a lot faster, ~70% for bothListandFixedSizeList. This PR also reworks the benchmark added in #18425, and these are the results on that compared to the implementation on main:Are these changes tested?
Covered by existing sqllogic tests, and one new test for
FixedSizeList.