-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(arrow-select): concat
kernel will merge dictionary values for list of dictionaries
#6893
feat(arrow-select): concat
kernel will merge dictionary values for list of dictionaries
#6893
Conversation
TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test
concat
kernel will merge dictionary values for list of dictionaries
arrow-select/src/concat.rs
Outdated
@@ -129,12 +130,106 @@ fn concat_dictionaries<K: ArrowDictionaryKeyType>( | |||
Ok(Arc::new(array)) | |||
} | |||
|
|||
fn concat_list_of_dictionaries<OffsetSize: OffsetSizeTrait, K: ArrowDictionaryKeyType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would significantly reduce the codegen to instead split concatenating the dictionary values, from re-encoding the offsets. As an added bonus this could also be done recursively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, please let me know it that was what you meant, I think it will be slower this way, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a quick look and left some suggestions on how to simplify this, in particular separating the concatenation of the list values and the list offsets will result in less code, require less codegen, and likely perform better.
i.e. something like (not tested)
DataType::List(f) => {
let values: Vec<_> = arrays.iter().map(|x| x.as_list::<i32>().values()).collect();
let new_values = concat(&values)?;
let offsets = OffsetBuffer::from_lengths(
arrays.iter().flat_map(|x| x.as_list::<i32>().offsets()
.windows(2).map(|w| w[1].as_usize() - w[0].as_usize()))
);
let nulls = ...
ListArray::new(f, offsets, new_values, nulls)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very promising, just a small comment r.e. avoiding ArrayData
I'll try to find some time for a more thorough review in the next few days if nobody else gets there first
…list of dictionaries (apache#6893) * feat(arrow-select): make list of dictionary merge dictionary keys TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test * fix concat lists of dictionaries * format * remove unused import * improve test helper * feat: add merge offset buffers into one * format * add reproduction tst * recommit * fix clippy * fix clippy * fix clippy * improve offsets code according to code review * use concat dictionaries * add specialize code to concat lists to be able to use the concat dictionary logic * remove the use of ArrayData
…list of dictionaries (apache#6893) * feat(arrow-select): make list of dictionary merge dictionary keys TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test * fix concat lists of dictionaries * format * remove unused import * improve test helper * feat: add merge offset buffers into one * format * add reproduction tst * recommit * fix clippy * fix clippy * fix clippy * improve offsets code according to code review * use concat dictionaries * add specialize code to concat lists to be able to use the concat dictionary logic * remove the use of ArrayData
…list of dictionaries (apache#6893) * feat(arrow-select): make list of dictionary merge dictionary keys TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test * fix concat lists of dictionaries * format * remove unused import * improve test helper * feat: add merge offset buffers into one * format * add reproduction tst * recommit * fix clippy * fix clippy * fix clippy * improve offsets code according to code review * use concat dictionaries * add specialize code to concat lists to be able to use the concat dictionary logic * remove the use of ArrayData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing the arrow 54.1.0 downstream in DataFusion and it appears this PR has introduced a regression:
I will try and look at it tomorrow if no one beats me to it.
Bug is related to concatenating sliced list arrays |
|
||
let values: Vec<&dyn Array> = lists | ||
.iter() | ||
.map(|x| x.values().as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bug is here, it needs to slice values() by the first offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you are correct (also needs to slice values to get the last offset)
Proposed fix here:
Which issue does this PR close?
Closes #6888
What changes are included in this PR?
merge
function inBufferOffset
for merging multipleBuffersOffset
into one (which needed for the concat merge)Are there any user-facing changes?
Yes, the
merge
function in theBufferOffset
struct