Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Arc<[Buffer]> instead of raw Vec<Buffer> in GenericByteViewArray for faster slice #6427

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion arrow-array/Cargo.toml
Original file line number Diff line number Diff line change
@@ -65,7 +65,7 @@ name = "occupancy"
harness = false

[[bench]]
name = "gc_view_types"
name = "view_types"
harness = false

[[bench]]
Original file line number Diff line number Diff line change
@@ -42,6 +42,12 @@ fn criterion_benchmark(c: &mut Criterion) {
black_box(sliced.gc());
});
});

c.bench_function("view types slice", |b| {
b.iter(|| {
black_box(array.slice(0, 100_000 / 2));
});
});
}

criterion_group!(benches, criterion_benchmark);
28 changes: 18 additions & 10 deletions arrow-array/src/array/byte_view_array.rs
Original file line number Diff line number Diff line change
@@ -114,7 +114,7 @@ use super::ByteArrayType;
pub struct GenericByteViewArray<T: ByteViewType + ?Sized> {
data_type: DataType,
views: ScalarBuffer<u128>,
buffers: Vec<Buffer>,
buffers: Arc<[Buffer]>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for Arc<[Buffer]> vs Vec<Arc>?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloning an Arc is relatively cheap (no allocation), cloning a Vec isn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get it. However, if i understand correctly, Arc<[Buffer]> means the buffers can be passed around and shared only when they are within single slice, which can be limiting. For example, Can i merge two arrays, combining their Arc<Buffer> s without moving or cloning the buffers?

Copy link
Contributor

@alamb alamb Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can i merge two arrays, combining their Arc s without moving or cloning the buffers?

No -- you would have to create a new Vec<Buffer> (or some other way to get Arc<[Buffer]>)

So while there are some cases where new allocations are required, slicing / cloning is faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would have to create a new Vec<Buffer>

but that would prevent buffer sharing between two arrays, right?

slicing / cloning is faster

cloning yes

slicing -- i didn't see it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the observation is that during StringViewArray::slice, the slice actually happens on the views -- the buffers (that the views can point at) must be copied

Here is the clone of buffers: https://docs.rs/arrow-array/53.1.0/src/arrow_array/array/byte_view_array.rs.html#385

phantom: PhantomData<T>,
nulls: Option<NullBuffer>,
}
@@ -178,7 +178,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
Ok(Self {
data_type: T::DATA_TYPE,
views,
buffers,
buffers: buffers.into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should take impl Into<Arc<[Buffer]>> in this method as well

nulls,
phantom: Default::default(),
})
@@ -191,14 +191,14 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
/// Safe if [`Self::try_new`] would not error
pub unsafe fn new_unchecked(
views: ScalarBuffer<u128>,
buffers: Vec<Buffer>,
buffers: impl Into<Arc<[Buffer]>>,
nulls: Option<NullBuffer>,
) -> Self {
Self {
data_type: T::DATA_TYPE,
phantom: Default::default(),
views,
buffers,
buffers: buffers.into(),
nulls,
}
}
@@ -208,7 +208,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
Self {
data_type: T::DATA_TYPE,
views: vec![0; len].into(),
buffers: vec![],
buffers: vec![].into(),
nulls: Some(NullBuffer::new_null(len)),
phantom: Default::default(),
}
@@ -234,7 +234,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
}

/// Deconstruct this array into its constituent parts
pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) {
pub fn into_parts(self) -> (ScalarBuffer<u128>, Arc<[Buffer]>, Option<NullBuffer>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a breaking change

(self.views, self.buffers, self.nulls)
}

@@ -516,7 +516,7 @@ impl<T: ByteViewType + ?Sized> From<ArrayData> for GenericByteViewArray<T> {
Self {
data_type: T::DATA_TYPE,
views,
buffers,
buffers: buffers.into(),
nulls: value.nulls().cloned(),
phantom: Default::default(),
}
@@ -569,12 +569,20 @@ where
}

impl<T: ByteViewType + ?Sized> From<GenericByteViewArray<T>> for ArrayData {
fn from(mut array: GenericByteViewArray<T>) -> Self {
fn from(array: GenericByteViewArray<T>) -> Self {
let len = array.len();
array.buffers.insert(0, array.views.into_inner());
let new_buffers = {
let mut buffers = Vec::with_capacity(array.buffers.len() + 1);
buffers.push(array.views.into_inner());
for buffer in array.buffers.iter() {
buffers.push(buffer.clone());
}
buffers
};

let builder = ArrayDataBuilder::new(T::DATA_TYPE)
.len(len)
.buffers(array.buffers)
.buffers(new_buffers)
.nulls(array.nulls);

unsafe { builder.build_unchecked() }