Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove From impls that copy data into arrow_buffer::Buffer #6033

Closed
alamb opened this issue Jul 10, 2024 · 5 comments · Fixed by #6039
Closed

Remove From impls that copy data into arrow_buffer::Buffer #6033

alamb opened this issue Jul 10, 2024 · 5 comments · Fixed by #6039
Assignees
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

@XiangpengHao found this on #6031 and I am breaking it into its own ticket

let block_id = output.append_block(self.buf.clone().into());

I thought the into() will convert the Bytes into array_buffer::Buffer() without copying the data.

However, self.buf is bytes::Bytes, not arrow_buffer::Bytes (they confusingly having the same name). The consequence is that the code above will use this From impl

impl<T: AsRef<[u8]>> From<T> for Buffer {
fn from(p: T) -> Self {
// allocate aligned memory buffer
let slice = p.as_ref();
let len = slice.len();
let mut buffer = MutableBuffer::new(len);
buffer.extend_from_slice(slice);
buffer.into()
}
}

Which results in a copy of data

To avoid the copy, the code is changed to

        // Here we convert `bytes::Bytes` into `arrow_buffer::Bytes`, which is zero copy
        // Then we convert `arrow_buffer::Bytes` into `arrow_buffer:Buffer`, which is also zero copy
        let bytes = arrow_buffer::Buffer::from_bytes(self.data.clone().into());
        let block_id = output.append_block(bytes);

Describe the solution you'd like
We should do something to prevent future very subtle mistakes (and possibly remove other instances of such mistakes in arrow-rs)

Describe alternatives you've considered
One thing we could do is to remove any From impls for Buffer that copy data and instead use an explicit function name. This would make it more explicit when copying was occuring, providing more control and making mistakes such as fixed in #6031 less likely

Something like this, for example

impl Buffer { 
  /// Creating a `Buffer` instance by copying the memory from a `AsRef<[u8]>` into a newly
  /// allocated memory region.
  pub fn new_from_slice<T: AsRef<[u8]>>(data: T) -> Self { 
   ...
}

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 10, 2024
@tustvold
Copy link
Contributor

1000% this, the current hacks of using Buffer::from_vec, etc... are counter-intuitive and surprising.

I had hoped that we would get specialization and be able to fix it this way, but sadly it looks like that is stuck in limbo. As such a breaking change to remove the blanket impl makes sense to me.

@tustvold tustvold added the api-change Changes to the arrow API label Jul 10, 2024
@XiangpengHao
Copy link
Contributor

Take

@alamb
Copy link
Contributor Author

alamb commented Jul 10, 2024

Take

🤔 we should probably bring the take workflow into this repo from datafusion: https://github.com/apache/datafusion/blob/main/.github/workflows/take.yml

@tustvold
Copy link
Contributor

Reopening as #6039 just updates some APIs, #6043 contains the breaking change

@alamb
Copy link
Contributor Author

alamb commented Jan 1, 2025

I believe we have completed this issue now #6043 is merged

@alamb alamb closed this as completed Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants