-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster slice
#6427
base: main
Are you sure you want to change the base?
Conversation
…rray` for faster `slice`
Arc<[Buffer]>
instead of raw Vec<Buffer>
in `GenericByteViewA…Arc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster slice
Arc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster sliceArc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster slice
Unfortunately the use of impl is still a breaking change as it could impact type inference, e.g if collecting an interator into the argument |
@@ -234,7 +234,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { | |||
} | |||
|
|||
/// Deconstruct this array into its constituent parts | |||
pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) { | |||
pub fn into_parts(self) -> (ScalarBuffer<u128>, Arc<[Buffer]>, Option<NullBuffer>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a breaking change
FWIW the reason "breaking change" is important is that it restricts when we can merge this PR: https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes |
@@ -114,7 +114,7 @@ use super::ByteArrayType; | |||
pub struct GenericByteViewArray<T: ByteViewType + ?Sized> { | |||
data_type: DataType, | |||
views: ScalarBuffer<u128>, | |||
buffers: Vec<Buffer>, | |||
buffers: Arc<[Buffer]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for Arc<[Buffer]> vs Vec<Arc>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloning an Arc
is relatively cheap (no allocation), cloning a Vec
isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i get it. However, if i understand correctly, Arc<[Buffer]>
means the buffers can be passed around and shared only when they are within single slice, which can be limiting. For example, Can i merge two arrays, combining their Arc<Buffer>
s without moving or cloning the buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can i merge two arrays, combining their Arc s without moving or cloning the buffers?
No -- you would have to create a new Vec<Buffer
> (or some other way to get Arc<[Buffer]>
)
So while there are some cases where new allocations are required, slicing / cloning is faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would have to create a new
Vec<Buffer
>
but that would prevent buffer sharing between two arrays, right?
slicing / cloning is faster
cloning yes
slicing -- i didn't see it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the observation is that during StringViewArray::slice
, the slice
actually happens on the view
s -- the buffers (that the views can point at) must be copied
Here is the clone of buffers
: https://docs.rs/arrow-array/53.1.0/src/arrow_array/array/byte_view_array.rs.html#385
Coming back to this now, I think we could merge this although I am curious if there is something funny going on here, which is why this is showing up in benchmarks. Someone popped up on discord the other day reporting a StringViewArray with ~10k buffers, this would suggest something is likely off somewhere. I wonder if kernels are blindly concatenating identical buffers together, instead of using something like Buffer::ptr_eq to avoid a new entry for the exact same buffer allocation? |
What was happening in DataFusion was we had a Filter --> Coalesce chain and thus basically calling However, to your point, it may well be the case that the input RecordBatches shared the same underlying buffer so maybe the same buffer was being appended multiple times @XiangpengHao do you remember if you looked for this ? (related to "Section 3.5: Buffer size tuning |
I think so: arrow-rs/arrow-data/src/transform/mod.rs Lines 630 to 637 in def94a8
I have experienced this when loading string view from Parquet. If the parquet data has 10k buffers of string data, the string view will just hold them. Typically we should run a filter and then gc it. This is handled in DF but if used out side DF users might need to do something similar like this: https://github.com/apache/datafusion/blob/c0ca4b4e449e07c3bcd6f3593fa31dd31ed5e0c5/datafusion/physical-plan/src/coalesce/mod.rs#L201-L221 In other words, if StringViewArray is constructed by us, it's very unlikely to have 10k buffers as we will exponentially grow the buffers ultil 2MB; so 10k buffers means ~20GB data |
Good call -- @onursatici has tracked down a related (perhaps the same) issue: |
All the discussion above not withstanding, I think the fact that this PR improves |
@@ -178,7 +178,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { | |||
Ok(Self { | |||
data_type: T::DATA_TYPE, | |||
views, | |||
buffers, | |||
buffers: buffers.into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should take impl Into<Arc<[Buffer]>>
in this method as well
Agreed What do people think about extracting a newtype |
I think it would be a great idea. @ShiKaiWi is that something you would be willing to try? |
Sorry for the late reply (I was a bit busy with work in the last few months). I'd love to try it on this weekend. |
Thank @ShiKaiWi -- I don't think there is any huge rush to get this feature in, so it would be great if you could do so ❤️ |
Converting to draft as I think there is some new planned work here. |
Which issue does this PR close?
Close #6408
Rationale for this change
In the
GenericByteViewArray
, thebuffers
field is a raw vector, leading to heap allocation when some methods are called, e.g.clone
,slice
. UsingArc<[Buffer]>
instead of the rawVec<Buffer>
can avoid such heap allocation.And the newly-add benchmark cases about
slice
shows the improvement:What changes are included in this PR?
Use
Arc<[Buffer]>
instead of the rawVec<Buffer>
as the type ofbuffers
field ofGenericByteViewArray
.Are there any user-facing changes?
The signature of the method
GenericByteViewArray::new_unchecked
is changed from:to
However, any usage of this method before this PR should still work without any modification.