feat: Implement LIST_VIEW and LARGE_LIST_VIEW support #710

WillAyd · 2025-02-04T01:23:01Z

No description provided.

WillAyd · 2025-02-04T01:24:03Z

src/nanoarrow/common/inline_types.h

@@ -630,7 +636,8 @@ enum ArrowBufferType {
  NANOARROW_BUFFER_TYPE_DATA_OFFSET,
  NANOARROW_BUFFER_TYPE_DATA,
  NANOARROW_BUFFER_TYPE_VARIADIC_DATA,
-  NANOARROW_BUFFER_TYPE_VARIADIC_SIZE
+  NANOARROW_BUFFER_TYPE_VARIADIC_SIZE,
+  NANOARROW_BUFFER_TYPE_SIZE,


Not sure if we wanted to add a new buffer type for this or not - is there an official document upstream we follow for this?

Nope...this is nanoarrow native! I think a new buffer type is a good idea (for the view offsets as well!)

WillAyd · 2025-02-04T01:29:16Z

src/nanoarrow/common/inline_array.h

      child_length = array->children[0]->length;
      NANOARROW_RETURN_NOT_OK(
          ArrowBufferAppendInt64(ArrowArrayBuffer(array, 1), child_length));
+      if (private_data->storage_type == NANOARROW_TYPE_LARGE_LIST_VIEW) {


Generally not sure how much we want the implementation of the list views tied to their respective plain list types. Here I've kept the cases group together and specialize within the branch, but also happy to move the view types to their own case statements

WillAyd · 2025-02-04T01:30:41Z

src/nanoarrow/common/inline_array.h

+      case NANOARROW_BUFFER_TYPE_SIZE:
+        NANOARROW_RETURN_NOT_OK(ArrowBufferAppendFill(buffer, 0, size_bytes * n));
+        continue;
+      case NANOARROW_BUFFER_TYPE_DATA_OFFSET: {


The existing offset type is tightly bound to the idea that you have n+1 elements in the buffer to provide all open and closed intervals. The list views do not require that, so I've specialized this here. But we could alternatively make a new buffer enum member for offsets that aren't double ended (?)

I think we want a new buffer type for this one because it has different rules than the existing DATA_OFFSET buffer (maybe NANOARROW_BUFFER_TYPE_VIEW_OFFSET?)

WillAyd · 2025-02-04T01:31:38Z

src/nanoarrow/common/array_test.cc

+               "Expected 1 child of list_view array but found 0 child arrays");
+  array.n_children = 1;
+
+  // Make sure final child size is checked at finish


This was just copied from the existing list types, where it is trivial to check the last offsets against the size of the array. The list view check I don't think is as simplistic - maybe we should just remove this from the validation?

I think we probably want the check, but because the check requires iterating along every item it should be checked in the ValidateFull rather than the ValidateDefault. That check is important for IPC because it prevents (if done correctly) out-of-bounds buffer access!

WillAyd · 2025-02-04T01:32:12Z

src/nanoarrow/common/array_test.cc

@@ -1555,6 +1555,149 @@ TEST(ArrayTest, ArrayTestAppendToLargeListArray) {
 #endif
 }

+TEST(ArrayTest, ArrayTestAppendToListViewArray) {


These tests were copy/paste from their list counterparts; in a follow up commit or PR I'd like to try and parametrize these

WillAyd · 2025-02-04T01:37:12Z

The current implementation here is really simplistic, and just continually appends children elements to the end of the array rather than trying to consolidate them in any way

paleolimbot

Thank you for taking this on! On the next round I will give this a check-out + run through 🙂

paleolimbot · 2025-02-06T20:16:28Z

src/nanoarrow/common/array_test.cc

+               "Expected 1 child of list_view array but found 0 child arrays");
+  array.n_children = 1;
+
+  // Make sure final child size is checked at finish


I think we probably want the check, but because the check requires iterating along every item it should be checked in the ValidateFull rather than the ValidateDefault. That check is important for IPC because it prevents (if done correctly) out-of-bounds buffer access!

paleolimbot · 2025-02-06T20:17:46Z

src/nanoarrow/common/inline_array.h

+      case NANOARROW_BUFFER_TYPE_SIZE:
+        NANOARROW_RETURN_NOT_OK(ArrowBufferAppendFill(buffer, 0, size_bytes * n));
+        continue;
+      case NANOARROW_BUFFER_TYPE_DATA_OFFSET: {


I think we want a new buffer type for this one because it has different rules than the existing DATA_OFFSET buffer (maybe NANOARROW_BUFFER_TYPE_VIEW_OFFSET?)

paleolimbot · 2025-02-06T20:21:42Z

src/nanoarrow/common/inline_array.h

+        offsets_buf = *ArrowArrayBuffer(array, 1);
+        sizes_buf = *ArrowArrayBuffer(array, 2);


I don't think you want to dereference a buffer pointer here (that will mean that the changes that you make to offsets_buf won't be syncrhonized back to the array!). I think struct ArrowBuffer* offsets_buf will work here?

Happy to change these but there shouldn't be any updates to the original data through these. Can also add const

Got it! I think it's still a good change for future code readers (the ArrowBuffer should really never be copied by value!)

paleolimbot · 2025-02-06T20:22:01Z

src/nanoarrow/common/inline_array.h

+        offsets_buf = *ArrowArrayBuffer(array, 1);
+        sizes_buf = *ArrowArrayBuffer(array, 2);


I think these should be pointers as well

paleolimbot · 2025-02-06T20:22:46Z

src/nanoarrow/common/inline_types.h

@@ -630,7 +636,8 @@ enum ArrowBufferType {
  NANOARROW_BUFFER_TYPE_DATA_OFFSET,
  NANOARROW_BUFFER_TYPE_DATA,
  NANOARROW_BUFFER_TYPE_VARIADIC_DATA,
-  NANOARROW_BUFFER_TYPE_VARIADIC_SIZE
+  NANOARROW_BUFFER_TYPE_VARIADIC_SIZE,
+  NANOARROW_BUFFER_TYPE_SIZE,


Nope...this is nanoarrow native! I think a new buffer type is a good idea (for the view offsets as well!)

paleolimbot · 2025-02-06T20:23:33Z

src/nanoarrow/common/inline_types.h

+  // The current offset used to build list views
+  int64_t list_view_offset;


Would it be possible to just use the length of the child array instead of add a parameter here?

What we need to keep track of somehow is the last valid offset preceding the value currently being appended, so we can calculate the current_child_length - previous_child_length to get the size of the current element.

The private data is necessary in the current implementation because we aren't keeping track of the most recently valid offset when Null/Empty data is being appended. Those types of data are currently inserting an offset and size of 0 for simplicity.

We can certainly change that append to keep repeating the most recently valid offset value with a size of 0 for empty elements, but I think that would either still require us to use private data OR backtrack to the last element contained within a NANOARROW_BUFFER_TYPE_VIEW_OFFSET during _ArrowArrayAppendEmptyInternal

paleolimbot · 2025-02-14T15:49:36Z

Just a note that I did see this and it's great! I'll try to review ASAP but it might be tomorrow or Sunday 😞

WillAyd · 2025-02-14T16:15:16Z

Not a problem. There is literally no rush on this at all. Doesn't need to be over the weekened

paleolimbot

Apologies for the turnaround here...great work and thank you for working on this!

A few comments here, and I think there are a few more test categories that need implementing for the new type:

ArrayView checks:

arrow-nanoarrow/src/nanoarrow/common/array_test.cc

Line 2804 in d368292

TEST(ArrayTest, ArrayViewTestList) {

SchemaInit checks:

arrow-nanoarrow/src/nanoarrow/common/schema_test.cc

Line 130 in d368292

TEST(SchemaTest, SchemaTestInitNestedList) {

SchemaViewInit checks:

arrow-nanoarrow/src/nanoarrow/common/schema_test.cc

Line 1309 in d368292

TEST(SchemaViewTest, SchemaViewInitNestedList) {

paleolimbot · 2025-02-22T15:38:48Z

src/nanoarrow/common/array.c

+
+      if ((array_view->storage_type == NANOARROW_TYPE_LIST_VIEW) &&


I think you can skip any check in default validation...the full validation check is the one that I would personally expect to run for the "check all elements" case and the first/last item size check that's here isn't something we do for lists or strings (unless I'm missing something or the full check somehow depends on this!)

I might be misunderstanding. To clarify further what I was thinking, lists and strings check the first/last offset in default validation, but have no size checking, given they do not have a dedicated size buffer. The list view repeats that same check for the offset buffer and extends it over to the size buffer.

So do you want to keep the first/last check of offsets for the list view in the default validation and move the size check to full validation? Or should we eliminate the size check altogether?

I also might be misunderstanding 🙂

You do validate both the offsets and the sizes in the full validation check, and my point is that here the first and last offsets and/or sizes are not more or less important than any of the others so I think it makes sense to do them all once, in the same place (the full validation).

You do validate both the offsets and the sizes in the full validation check

I think you meant to say default validation check here, in which case understood and happy to move those checks to full. The only nuance there again is that it probably will make the test parametrization more complicated, as the list types keep the first/last offset checks in their default validation implementation and test

Got it! I think the current diff looks great, thanks for bearing with me on that one!

paleolimbot · 2025-02-22T15:48:07Z

src/nanoarrow/common/inline_array.h

+          const int64_t zero_val = 0;
+          for (int64_t j = 0; j < n; j++) {
+            ArrowBufferAppendUnsafe(buffer, &zero_val, size_bytes);
+          }


ArrowBufferFill() like above?

paleolimbot · 2025-02-22T15:52:22Z

src/nanoarrow/common/inline_array.h

+          for (int64_t j = 0; j < n; j++) {
+            ArrowBufferAppendUnsafe(
+                buffer, buffer->data + size_bytes * (array->length + j - 1), size_bytes);
+          }


Can this just be zero? (If I remember correctly, the view offsets don't have to be in order, and this is an empty element?).

Yes it can be. The only advantage I think to doing it this way is it aligns better with how the current LIST types work, possibly making the tests easier to parametrize.

We can cross that bridge when we get there though; for now setting to 0 should work fine

paleolimbot · 2025-02-22T16:00:32Z

src/nanoarrow/common/inline_array.h

+          ArrowBufferAppendInt32(ArrowArrayBuffer(array, 1), last_valid_offset));
+      NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(
+          ArrowArrayBuffer(array, 2), (int32_t)child_length - last_valid_offset));
+      private_data->list_view_offset = (int32_t)child_length;


Suggested change

private_data->list_view_offset = (int32_t)child_length;

private_data->list_view_offset = child_length;

codecov-commenter · 2025-02-26T19:22:36Z

Codecov Report

Attention: Patch coverage is 34.14634% with 27 lines in your changes missing coverage. Please review.

Project coverage is 78.92%. Comparing base (18652fc) to head (c8e2add).
Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
src/nanoarrow/testing/testing.cc	0.00%	14 Missing ⚠️
src/nanoarrow/common/inline_array.h	60.86%	1 Missing and 8 partials ⚠️
src/nanoarrow/common/inline_types.h	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #710      +/-   ##
==========================================
- Coverage   79.24%   78.92%   -0.33%     
==========================================
  Files         105      105              
  Lines       15065    15230     +165     
  Branches     1712     1732      +20     
==========================================
+ Hits        11939    12020      +81     
- Misses       2044     2120      +76     
- Partials     1082     1090       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paleolimbot

Apologies for the turnaround again, but thank you!

WillAyd force-pushed the list-view-impl branch from efce682 to 332b591 Compare February 4, 2025 01:28

WillAyd commented Feb 4, 2025

View reviewed changes

WillAyd force-pushed the list-view-impl branch from 332b591 to b99e912 Compare February 4, 2025 01:35

WillAyd force-pushed the list-view-impl branch 7 times, most recently from c31955e to f1b41de Compare February 4, 2025 18:17

WillAyd marked this pull request as ready for review February 4, 2025 21:24

feat: Implement LIST_VIEW and LARGE_LIST_VIEW support

e78b618

WillAyd force-pushed the list-view-impl branch from f1b41de to e78b618 Compare February 5, 2025 14:35

paleolimbot reviewed Feb 6, 2025

View reviewed changes

WillAyd added 4 commits February 7, 2025 16:38

Dewey feedback

6324899

Clean up list_view_offset usage

b29ab56

Added full test validation

75ddddd

Simplifications

63bccb8

Remove extraneous check

69aa592

paleolimbot reviewed Feb 22, 2025

View reviewed changes

WillAyd added 6 commits February 25, 2025 15:13

Dewey feedback

359d578

Refactor and ArrayView tests

675490c

ListView Schema Tests

0bb461c

Move ListView validations to FULL

de5ef69

Warning cleanup

d455f99

Diff reduction

c8e2add

paleolimbot approved these changes Feb 28, 2025

View reviewed changes

WillAyd merged commit e77a656 into apache:main Mar 4, 2025
36 checks passed

WillAyd deleted the list-view-impl branch March 4, 2025 21:44

WillAyd mentioned this pull request Mar 6, 2025

Add support for list view type in C library #616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement LIST_VIEW and LARGE_LIST_VIEW support #710

feat: Implement LIST_VIEW and LARGE_LIST_VIEW support #710

WillAyd commented Feb 4, 2025

WillAyd Feb 4, 2025

paleolimbot Feb 6, 2025

WillAyd Feb 4, 2025

WillAyd Feb 4, 2025

paleolimbot Feb 6, 2025

WillAyd Feb 4, 2025

paleolimbot Feb 6, 2025

WillAyd Feb 4, 2025

WillAyd commented Feb 4, 2025

paleolimbot left a comment

paleolimbot Feb 6, 2025

paleolimbot Feb 6, 2025

paleolimbot Feb 6, 2025

WillAyd Feb 7, 2025

paleolimbot Feb 8, 2025

paleolimbot Feb 6, 2025

paleolimbot Feb 6, 2025

paleolimbot Feb 6, 2025

WillAyd Feb 12, 2025 •

edited

Loading

paleolimbot commented Feb 14, 2025

WillAyd commented Feb 14, 2025

paleolimbot left a comment

paleolimbot Feb 22, 2025

WillAyd Feb 25, 2025

paleolimbot Feb 26, 2025 •

edited

Loading

WillAyd Feb 26, 2025

paleolimbot Feb 28, 2025

paleolimbot Feb 22, 2025

paleolimbot Feb 22, 2025

WillAyd Feb 25, 2025

paleolimbot Feb 22, 2025

codecov-commenter commented Feb 26, 2025 •

edited

Loading

paleolimbot left a comment

		offsets_buf = *ArrowArrayBuffer(array, 1);
		sizes_buf = *ArrowArrayBuffer(array, 2);

		// The current offset used to build list views
		int64_t list_view_offset;


		if ((array_view->storage_type == NANOARROW_TYPE_LIST_VIEW) &&

	private_data->list_view_offset = (int32_t)child_length;
	private_data->list_view_offset = child_length;

feat: Implement LIST_VIEW and LARGE_LIST_VIEW support #710

feat: Implement LIST_VIEW and LARGE_LIST_VIEW support #710

Conversation

WillAyd commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 4, 2025

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

paleolimbot commented Feb 14, 2025

WillAyd commented Feb 14, 2025

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 26, 2025 • edited Loading

Codecov Report

paleolimbot left a comment

Choose a reason for hiding this comment

WillAyd Feb 12, 2025 •

edited

Loading

paleolimbot Feb 26, 2025 •

edited

Loading

codecov-commenter commented Feb 26, 2025 •

edited

Loading