-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ByteView::try_new #5735
Add ByteView::try_new #5735
Conversation
/// | ||
/// If `v` instead contains the binary data inline, returns an `Err` containing it | ||
#[inline] | ||
pub fn try_new(v: &u128) -> Result<Self, &[u8]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea would then be that we would remove the From<u128>
implementation, which would be a breaking change but the next release is going to be breaking anyway (and I suspect few people are relying on this API yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what most confuses me about ByteView as a struct is that it doesn't represent in Rust anywhere the different layouts of the u128
s in ByteViewArrays
For example, if you look at the rust ByteView
struct without consulting the arrow spec, you may come to the conclusion that the u128
s in a ByteViewArray have this format, which is not the case and thus you need to
- Know to check "is length less than 12" and if so handle things specially (this API helps here by encapsulating that check for certain cases)
- Know how to construct a view from bytes (aka how much of the prefix to copy and where.
This API seems to improve things (though I think if we went this way we should expand its docstring to explain the difference between the two types of byte views)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
represent in Rust anywhere the different layouts of the u128s in ByteViewArrays
Correct, it represents the non-inlined case where you have a view, and not just a short inlined byte array.
This is consistent with the terminology used in the docs - https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
And with the terminology for ListView and LargeListView, where a view represents a view into a separate buffer of data
this API helps here by encapsulating that check for certain cases
What cases does it not encapsulate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it represents the non-inlined case where you have a view, and not just a short inlined byte array.
I see -- in my mind the combination of (length, inlined data)
is also a "view" but I can see how you have a different interpretation (perhaps if you view the types as (length, inline)
or (length, view)
🤔 )
The diference I am thinking about is two layouts shown here (described as "view structures")
However, there is a single ByteView
rust struct (that corresponds to "Long strings")
What cases does it not encapsulate?
One case is creating the u128
initially (e.g. if we should copy 4 bytes or up to 12)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One case is creating the u128 initially (e.g. if we should copy 4 bytes or up to 12)
Right, but does this need to be its own type, or could it be a free function? I don't know the answer to this, yet, but I would always take no abstraction over a bad abstraction
I updated one call site to show how it works, I believe it achieves the same end as the linked PR whilst being significantly simpler |
let data = self.buffers.get_unchecked(view.buffer_index as usize); | ||
let offset = view.offset as usize; | ||
data.get_unchecked(offset..offset + len as usize) | ||
let b = match ByteView::try_new(v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can see here how try_new encapsulates the logic for interpreting the u128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @tustvold
It seems to me the key difference between the existing ByteView
approach (that this PR extends) and the approach in #5619 is an explicy Rust API for manipulating / accessing the inline variant of the u128
vies.
I think this PR improves the usability of ByteView
but I still think #5619 (or another approach that models the two types of views as separate Rust structs) is easier to understand
/// | ||
/// If `v` instead contains the binary data inline, returns an `Err` containing it | ||
#[inline] | ||
pub fn try_new(v: &u128) -> Result<Self, &[u8]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what most confuses me about ByteView as a struct is that it doesn't represent in Rust anywhere the different layouts of the u128
s in ByteViewArrays
For example, if you look at the rust ByteView
struct without consulting the arrow spec, you may come to the conclusion that the u128
s in a ByteViewArray have this format, which is not the case and thus you need to
- Know to check "is length less than 12" and if so handle things specially (this API helps here by encapsulating that check for certain cases)
- Know how to construct a view from bytes (aka how much of the prefix to copy and where.
This API seems to improve things (though I think if we went this way we should expand its docstring to explain the difference between the two types of byte views)
superceded by #5796 |
Which issue does this PR close?
Closes #.
Rationale for this change
Potentially simpler version of #5619
What changes are included in this PR?
Are there any user-facing changes?