Encapsulate `View` logic for `GenericByteViewArray` #5619

alamb · 2024-04-09T19:26:48Z

~~Draft PR as I would love to know if anyone has any comments about this idea / approach~~

Note this looks like a large PR but it is mostly comments and tests

Which issue does this PR close?

Part of #5374

Rationale for this change

While reviewing #5557 from @ariesdevil -- it seemed like the 12 byte view length was getting hard coded many places and the same code is repeated over and over and makes the code harder to work with.

While encoding the number 12 isn't a huge deal, what is a larger deal is the cognative overhead of working with ByteViewArrays while having to remember in your head the layouts of what the u128 represents. Adding an abstraction makes this easier in my opinion

What changes are included in this PR?

Encapsulate view manipulation in typesafe wrappers over u128

Add a View enum that captures which view is used
Add an OwnedView and OwnedViewBuilder for creating Views
Add InlineView and OffsetView
Rewrite some of the code to use the new View

Are there any user-facing changes?

Several new types, but all existing types still pass

arrow-array/src/builder/generic_bytes_view_builder.rs

arrow-data/src/transform/mod.rs

alamb · 2024-05-06T13:08:54Z

arrow-array/src/array/byte_view_array.rs

 ///
-/// ```text


moved to View

alamb · 2024-05-06T13:13:31Z

arrow-array/src/builder/generic_bytes_view_builder.rs

-            if !flushed.is_empty() {
-                assert!(self.completed.len() < u32::MAX as usize);
-                self.completed.push(flushed.into());
+        let view: u128 = match OwnedView::from(v) {


I think the construction is now much clearer and encapsulated -- the bit manipulation is handled by OwnedView and OffsetViewBuilder

alamb · 2024-05-06T13:14:47Z

arrow-data/src/byte_view.rs

+}
+impl<'a> From<&'a u128> for View<'a> {
+    #[inline(always)]
+    fn from(v: &'a u128) -> Self {


this is the key API for a View -- dispatching to the correct variant

alamb · 2024-05-06T13:20:31Z

arrow-data/src/byte_view.rs

+    }
+}
+
+/// A view for data where the variable length data has 12 or fewer bytes. See


If this PR is merged, I will make a follow on PR to remove the remaining uses of ByteView and deprecate this struct

alamb · 2024-05-06T13:21:02Z

arrow-data/src/transform/mod.rs

@@ -178,13 +178,17 @@ fn build_extend_view(array: &ArrayData, buffer_offset: u32) -> Extend {
            mutable
                .buffer1
                .extend(views[start..start + len].iter().map(|v| {


Here is another example of how the magic 12 number gets extracted out

ariesdevil · 2024-05-06T13:40:18Z

arrow-data/src/byte_view.rs

+/// # Notes
+/// Equality is based on the bitwise value of the view, not the data it logically points to
+#[derive(Debug, Copy, Clone, PartialEq)]
+pub enum View<'a> {


Do we need into_owned that return OwnedView?

Good idea, will add it

in fe510ef

alamb · 2024-05-06T13:59:58Z

CI is failing due to #5725 and #5719

But otherwise I think this PR is ready for review

ariesdevil

Clear and neat

alamb · 2024-05-06T15:17:40Z

Thanks for the review @ariesdevil

tustvold

I had a brief look at this and I think this could be made a fair bit simpler, in particular I don't really understand the need for separate borrowed/owning variants when u128 is copy.

However, that does lead to my biggest concern, that the formulation using an enumeration forces a branch where it isn't strictly necessary. For example, prefix comparison can be done without needing to check the length at all. This makes me wonder if just a simple ByteView(u128) would suffice to encapsulate the short-string logic, or something...

I personally would feel more comfortable introducing such abstractions after we have a decent set of kernels with accompanying benchmarks, so we an empirically reason about the performance implications of such a change, along with having examples to inform the API design. As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅

Edit: I created #5735 which might provide a simpler way to achieve the same end

tustvold · 2024-05-08T06:12:44Z

arrow-data/src/byte_view.rs

+/// # Notes
+/// Equality is based on the bitwise value of the view, not the data it logically points to
+#[derive(PartialEq)]
+pub enum OwnedView {


What is this type adding, it feels like it isn't entirely necessary?

Its usecase is to create a view from &str / &[u8] and copy the relevant prefix bytes and remember which variant (inline or offset) the new view was during creation

tustvold · 2024-05-08T06:13:06Z

arrow-data/src/byte_view.rs

+/// # Notes
+/// Equality is based on the bitwise value of the view, not the data it logically points to
+#[derive(Debug, Copy, Clone, PartialEq)]
+pub enum View<'a> {


This should probably be ByteView to avoid confusion with the list view types

There is already another struct named ByteView in this file, so in order to avoid an API change I didn't reuse the name. If we don't care about API changes we could remove the existing ByteView

tustvold · 2024-05-08T06:13:51Z

arrow-data/src/byte_view.rs

+    pub fn as_u128(self) -> &'a u128 {
+        self.0


Suggested change

pub fn as_u128(self) -> &'a u128 {

self.0

pub fn as_u128(self) -> u128 {

*self.0

tustvold · 2024-05-08T06:14:34Z

arrow-data/src/byte_view.rs

+/// Equality is based on the bitwise value of the view, not the data it
+/// logically points to
+#[derive(Copy, Clone, PartialEq)]
+pub struct InlineView<'a>(&'a u128);


I wonder if we even need this borrow, u128 is copy and removing the indirection might help LLVM not be stupid

alamb

I had a brief look at this and I think this could be made a fair bit simpler, in particular I don't really understand the need for separate borrowed/owning variants when u128 is copy.

That was to encapsulate copying the inlined bytes and retain the information about which type it was (without having to check the length again)

However, that does lead to my biggest concern, that the formulation using an enumeration forces a branch where it isn't strictly necessary. For example, prefix comparison can be done without needing to check the length at all.

I don't understand this assertion. In my mind the whole point of the View abstraction is to encapsulate the check for length as the two different variants need different handling. For special cases like prefix comparsion we could certainly add specialized functions like View::compare_prefix

I personally would feel more comfortable introducing such abstractions after we have a decent set of kernels with accompanying benchmarks, so we an empirically reason about the performance implications of such a change, along with having examples to inform the API design. As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅

I think the kernels and usage patterns we have are sufficient. There are several other uses of ByteView that I didn't port over in this PR to keep it smaller (such as

arrow-rs/arrow-data/src/equal/byte_view.rs

Lines 20 to 70 in 4045fb5

    
           pub(super) fn byte_view_equal( 
        
               lhs: &ArrayData, 
        
               rhs: &ArrayData, 
        
               lhs_start: usize, 
        
               rhs_start: usize, 
        
               len: usize, 
        
           ) -> bool { 
        
               let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len]; 
        
               let lhs_buffers = &lhs.buffers()[1..]; 
        
               let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len]; 
        
               let rhs_buffers = &rhs.buffers()[1..]; 
        
               for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() { 
        
                   // Only checking one null mask here because by the time the control flow reaches 
        
                   // this point, the equality of the two masks would have already been verified. 
        
                   if lhs.is_null(idx) { 
        
                       continue; 
        
                   } 
        
                   let l_len_prefix = *l as u64; 
        
                   let r_len_prefix = *r as u64; 
        
                   // short-circuit, check length and prefix 
        
                   if l_len_prefix != r_len_prefix { 
        
                       return false; 
        
                   } 
        
                   let len = l_len_prefix as u32; 
        
                   // for inline storage, only need check view 
        
                   if len <= 12 { 
        
                       if l != r { 
        
                           return false; 
        
                       } 
        
                       continue; 
        
                   } 
        
                   // check buffers 
        
                   let l_view = ByteView::from(*l); 
        
                   let r_view = ByteView::from(*r); 
        
                   let l_buffer = &lhs_buffers[l_view.buffer_index as usize]; 
        
                   let r_buffer = &rhs_buffers[r_view.buffer_index as usize]; 
        
                   // prefixes are already known to be equal; skip checking them 
        
                   let len = len as usize - 4; 
        
                   let l_offset = l_view.offset as usize + 4; 
        
                   let r_offset = r_view.offset as usize + 4; 
        
                   if l_buffer[l_offset..l_offset + len] != r_buffer[r_offset..r_offset + len] { 
        
                       return false; 
        
                   } 
        
               } 
        
               true

) but I can do so to show how it would work

Edit: I created #5735 which might provide a simpler way to achieve the same end

I don't understand how this solves the same problem but I probably don't fully understand it

alamb · 2024-05-08T11:27:03Z

arrow-data/src/byte_view.rs

+/// # Notes
+/// Equality is based on the bitwise value of the view, not the data it logically points to
+#[derive(Debug, Copy, Clone, PartialEq)]
+pub enum View<'a> {


There is already another struct named ByteView in this file, so in order to avoid an API change I didn't reuse the name. If we don't care about API changes we could remove the existing ByteView

alamb · 2024-05-08T11:29:05Z

arrow-data/src/byte_view.rs

+/// # Notes
+/// Equality is based on the bitwise value of the view, not the data it logically points to
+#[derive(PartialEq)]
+pub enum OwnedView {


Its usecase is to create a view from &str / &[u8] and copy the relevant prefix bytes and remember which variant (inline or offset) the new view was during creation

alamb · 2024-05-08T11:41:49Z

In order to move this PR forward, I plan to do the following:

Run (and create if necessary) benchmarks for operations
Remove the remaining uses of ByteView in favor of this new View abstraction to see what it looks like

alamb · 2024-05-08T12:09:14Z

As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅

I updated the description of this PR to make the rationale clearer.

While encoding the number 12 isn't a huge deal in my mind, the core rationale for this PR does is to reduce the cognative overhead of working with ByteViewArrays. Without an abstraction such as View, you have to remember what the u128 represents and both its variants.

While some people have the necessary time to invest understanding the lowest level representation, I think it is a barrier to contribution (as well as using the library). I have seen evidence of this challenge in a few examples such as #5557 to build up the u128 from the parquet data pages as well as #5707 (comment) in #5707.

While there might be other explanations, I think an abstraction like this will make ByteViewArrays much easier to work with

tustvold · 2024-05-16T10:28:39Z

Marking as a draft pending #5736

alamb · 2024-05-29T18:19:29Z

I believe we are taking a different approach -- see #5736 and #5796

alamb commented Apr 9, 2024

View reviewed changes

arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved

github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2024

alamb commented Apr 9, 2024

View reviewed changes

arrow-data/src/transform/mod.rs Outdated Show resolved Hide resolved

alamb changed the title ~~WIP: Encapsulate View logic more~~ WIP: Encapsulate Short/Long View logic in StringViewArray Apr 9, 2024

alamb mentioned this pull request Apr 9, 2024

feat: support reading and writingStringView and BinaryView in parquet (part 2) #5557

Closed

alamb mentioned this pull request Apr 26, 2024

Support casting StringArray/BinaryArray --> StringView / BinaryView #5686

Merged

alamb force-pushed the alamb/views_struct branch from dd93198 to 653065b Compare May 6, 2024 13:02

alamb changed the title ~~WIP: Encapsulate Short/Long View logic in StringViewArray~~ Encapsulate View logic for GenericByteViewArray May 6, 2024

alamb commented May 6, 2024

View reviewed changes

ariesdevil reviewed May 6, 2024

View reviewed changes

Encapsulate View manipulation

ffbf53b

alamb force-pushed the alamb/views_struct branch from 8656ea8 to ffbf53b Compare May 6, 2024 13:44

Add View::to_owned()

fe510ef

alamb marked this pull request as ready for review May 6, 2024 13:59

alamb mentioned this pull request May 6, 2024

[EPIC] Implement StringViewArray and BinaryViewArray #5374

Closed

31 tasks

ariesdevil approved these changes May 6, 2024

View reviewed changes

alamb mentioned this pull request May 6, 2024

DataFusion weekly project plan (Andrew Lamb) - May 6, 2024 apache/datafusion#10395

Closed

7 tasks

tustvold reviewed May 8, 2024

View reviewed changes

tustvold mentioned this pull request May 8, 2024

Add ByteView::try_new #5735

Closed

alamb commented May 8, 2024

View reviewed changes

tustvold mentioned this pull request May 8, 2024

Structured ByteView Access (underlying StringView/BinaryView representation) #5736

Closed

alamb mentioned this pull request May 13, 2024

DataFusion weekly project plan (Andrew Lamb) - May 13, 2024 apache/datafusion#10482

Closed

8 tasks

tustvold marked this pull request as draft May 16, 2024 10:28

alamb closed this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encapsulate `View` logic for `GenericByteViewArray` #5619

Encapsulate `View` logic for `GenericByteViewArray` #5619

alamb commented Apr 9, 2024 •

edited

Loading

alamb May 6, 2024

alamb May 6, 2024

alamb May 6, 2024

alamb May 6, 2024

alamb May 6, 2024

ariesdevil May 6, 2024

alamb May 6, 2024

alamb May 6, 2024

alamb commented May 6, 2024

ariesdevil left a comment

alamb commented May 6, 2024

tustvold left a comment •

edited

Loading

tustvold May 8, 2024

alamb May 8, 2024

tustvold May 8, 2024

alamb May 8, 2024

tustvold May 8, 2024

tustvold May 8, 2024

alamb left a comment

alamb May 8, 2024

alamb May 8, 2024

alamb commented May 8, 2024

alamb commented May 8, 2024

tustvold commented May 16, 2024

alamb commented May 29, 2024

	pub(super) fn byte_view_equal(
	lhs: &ArrayData,
	rhs: &ArrayData,
	lhs_start: usize,
	rhs_start: usize,
	len: usize,
	) -> bool {
	let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
	let lhs_buffers = &lhs.buffers()[1..];
	let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
	let rhs_buffers = &rhs.buffers()[1..];

	for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
	// Only checking one null mask here because by the time the control flow reaches
	// this point, the equality of the two masks would have already been verified.
	if lhs.is_null(idx) {
	continue;
	}

	let l_len_prefix = *l as u64;
	let r_len_prefix = *r as u64;
	// short-circuit, check length and prefix
	if l_len_prefix != r_len_prefix {
	return false;
	}

	let len = l_len_prefix as u32;
	// for inline storage, only need check view
	if len <= 12 {
	if l != r {
	return false;
	}
	continue;
	}

	// check buffers
	let l_view = ByteView::from(*l);
	let r_view = ByteView::from(*r);

	let l_buffer = &lhs_buffers[l_view.buffer_index as usize];
	let r_buffer = &rhs_buffers[r_view.buffer_index as usize];

	// prefixes are already known to be equal; skip checking them
	let len = len as usize - 4;
	let l_offset = l_view.offset as usize + 4;
	let r_offset = r_view.offset as usize + 4;
	if l_buffer[l_offset..l_offset + len] != r_buffer[r_offset..r_offset + len] {
	return false;
	}
	}
	true

Encapsulate View logic for GenericByteViewArray #5619

Encapsulate View logic for GenericByteViewArray #5619

Conversation

alamb commented Apr 9, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 6, 2024

ariesdevil left a comment

Choose a reason for hiding this comment

alamb commented May 6, 2024

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 8, 2024

alamb commented May 8, 2024

tustvold commented May 16, 2024

alamb commented May 29, 2024

Encapsulate `View` logic for `GenericByteViewArray` #5619

Encapsulate `View` logic for `GenericByteViewArray` #5619

alamb commented Apr 9, 2024 •

edited

Loading

tustvold left a comment •

edited

Loading