Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement LIST_VIEW and LARGE_LIST_VIEW support #710

Merged
merged 12 commits into from
Mar 4, 2025

Conversation

WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Feb 4, 2025

No description provided.

@@ -630,7 +636,8 @@ enum ArrowBufferType {
NANOARROW_BUFFER_TYPE_DATA_OFFSET,
NANOARROW_BUFFER_TYPE_DATA,
NANOARROW_BUFFER_TYPE_VARIADIC_DATA,
NANOARROW_BUFFER_TYPE_VARIADIC_SIZE
NANOARROW_BUFFER_TYPE_VARIADIC_SIZE,
NANOARROW_BUFFER_TYPE_SIZE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we wanted to add a new buffer type for this or not - is there an official document upstream we follow for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope...this is nanoarrow native! I think a new buffer type is a good idea (for the view offsets as well!)

child_length = array->children[0]->length;
NANOARROW_RETURN_NOT_OK(
ArrowBufferAppendInt64(ArrowArrayBuffer(array, 1), child_length));
if (private_data->storage_type == NANOARROW_TYPE_LARGE_LIST_VIEW) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally not sure how much we want the implementation of the list views tied to their respective plain list types. Here I've kept the cases group together and specialize within the branch, but also happy to move the view types to their own case statements

case NANOARROW_BUFFER_TYPE_SIZE:
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendFill(buffer, 0, size_bytes * n));
continue;
case NANOARROW_BUFFER_TYPE_DATA_OFFSET: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing offset type is tightly bound to the idea that you have n+1 elements in the buffer to provide all open and closed intervals. The list views do not require that, so I've specialized this here. But we could alternatively make a new buffer enum member for offsets that aren't double ended (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a new buffer type for this one because it has different rules than the existing DATA_OFFSET buffer (maybe NANOARROW_BUFFER_TYPE_VIEW_OFFSET?)

"Expected 1 child of list_view array but found 0 child arrays");
array.n_children = 1;

// Make sure final child size is checked at finish
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just copied from the existing list types, where it is trivial to check the last offsets against the size of the array. The list view check I don't think is as simplistic - maybe we should just remove this from the validation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want the check, but because the check requires iterating along every item it should be checked in the ValidateFull rather than the ValidateDefault. That check is important for IPC because it prevents (if done correctly) out-of-bounds buffer access!

@@ -1555,6 +1555,149 @@ TEST(ArrayTest, ArrayTestAppendToLargeListArray) {
#endif
}

TEST(ArrayTest, ArrayTestAppendToListViewArray) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were copy/paste from their list counterparts; in a follow up commit or PR I'd like to try and parametrize these

@WillAyd
Copy link
Contributor Author

WillAyd commented Feb 4, 2025

The current implementation here is really simplistic, and just continually appends children elements to the end of the array rather than trying to consolidate them in any way

@WillAyd WillAyd force-pushed the list-view-impl branch 7 times, most recently from c31955e to f1b41de Compare February 4, 2025 18:17
@WillAyd WillAyd marked this pull request as ready for review February 4, 2025 21:24
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this on! On the next round I will give this a check-out + run through 🙂

"Expected 1 child of list_view array but found 0 child arrays");
array.n_children = 1;

// Make sure final child size is checked at finish
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want the check, but because the check requires iterating along every item it should be checked in the ValidateFull rather than the ValidateDefault. That check is important for IPC because it prevents (if done correctly) out-of-bounds buffer access!

case NANOARROW_BUFFER_TYPE_SIZE:
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendFill(buffer, 0, size_bytes * n));
continue;
case NANOARROW_BUFFER_TYPE_DATA_OFFSET: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a new buffer type for this one because it has different rules than the existing DATA_OFFSET buffer (maybe NANOARROW_BUFFER_TYPE_VIEW_OFFSET?)

Comment on lines 802 to 803
offsets_buf = *ArrowArrayBuffer(array, 1);
sizes_buf = *ArrowArrayBuffer(array, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to dereference a buffer pointer here (that will mean that the changes that you make to offsets_buf won't be syncrhonized back to the array!). I think struct ArrowBuffer* offsets_buf will work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change these but there shouldn't be any updates to the original data through these. Can also add const

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I think it's still a good change for future code readers (the ArrowBuffer should really never be copied by value!)

Comment on lines 826 to 827
offsets_buf = *ArrowArrayBuffer(array, 1);
sizes_buf = *ArrowArrayBuffer(array, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be pointers as well

@@ -630,7 +636,8 @@ enum ArrowBufferType {
NANOARROW_BUFFER_TYPE_DATA_OFFSET,
NANOARROW_BUFFER_TYPE_DATA,
NANOARROW_BUFFER_TYPE_VARIADIC_DATA,
NANOARROW_BUFFER_TYPE_VARIADIC_SIZE
NANOARROW_BUFFER_TYPE_VARIADIC_SIZE,
NANOARROW_BUFFER_TYPE_SIZE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope...this is nanoarrow native! I think a new buffer type is a good idea (for the view offsets as well!)

Comment on lines +875 to +876
// The current offset used to build list views
int64_t list_view_offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to just use the length of the child array instead of add a parameter here?

Copy link
Contributor Author

@WillAyd WillAyd Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need to keep track of somehow is the last valid offset preceding the value currently being appended, so we can calculate the current_child_length - previous_child_length to get the size of the current element.

The private data is necessary in the current implementation because we aren't keeping track of the most recently valid offset when Null/Empty data is being appended. Those types of data are currently inserting an offset and size of 0 for simplicity.

We can certainly change that append to keep repeating the most recently valid offset value with a size of 0 for empty elements, but I think that would either still require us to use private data OR backtrack to the last element contained within a NANOARROW_BUFFER_TYPE_VIEW_OFFSET during _ArrowArrayAppendEmptyInternal

@paleolimbot
Copy link
Member

Just a note that I did see this and it's great! I'll try to review ASAP but it might be tomorrow or Sunday 😞

@WillAyd
Copy link
Contributor Author

WillAyd commented Feb 14, 2025

Not a problem. There is literally no rush on this at all. Doesn't need to be over the weekened

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the turnaround here...great work and thank you for working on this!

A few comments here, and I think there are a few more test categories that need implementing for the new type:

ArrayView checks:

TEST(ArrayTest, ArrayViewTestList) {

SchemaInit checks:

TEST(SchemaTest, SchemaTestInitNestedList) {

SchemaViewInit checks:

TEST(SchemaViewTest, SchemaViewInitNestedList) {

Comment on lines 1178 to 1179

if ((array_view->storage_type == NANOARROW_TYPE_LIST_VIEW) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can skip any check in default validation...the full validation check is the one that I would personally expect to run for the "check all elements" case and the first/last item size check that's here isn't something we do for lists or strings (unless I'm missing something or the full check somehow depends on this!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding. To clarify further what I was thinking, lists and strings check the first/last offset in default validation, but have no size checking, given they do not have a dedicated size buffer. The list view repeats that same check for the offset buffer and extends it over to the size buffer.

So do you want to keep the first/last check of offsets for the list view in the default validation and move the size check to full validation? Or should we eliminate the size check altogether?

Copy link
Member

@paleolimbot paleolimbot Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also might be misunderstanding 🙂

You do validate both the offsets and the sizes in the full validation check, and my point is that here the first and last offsets and/or sizes are not more or less important than any of the others so I think it makes sense to do them all once, in the same place (the full validation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do validate both the offsets and the sizes in the full validation check

I think you meant to say default validation check here, in which case understood and happy to move those checks to full. The only nuance there again is that it probably will make the test parametrization more complicated, as the list types keep the first/last offset checks in their default validation implementation and test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I think the current diff looks great, thanks for bearing with me on that one!

Comment on lines 320 to 323
const int64_t zero_val = 0;
for (int64_t j = 0; j < n; j++) {
ArrowBufferAppendUnsafe(buffer, &zero_val, size_bytes);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrowBufferFill() like above?

Comment on lines 325 to 328
for (int64_t j = 0; j < n; j++) {
ArrowBufferAppendUnsafe(
buffer, buffer->data + size_bytes * (array->length + j - 1), size_bytes);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be zero? (If I remember correctly, the view offsets don't have to be in order, and this is an empty element?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it can be. The only advantage I think to doing it this way is it aligns better with how the current LIST types work, possibly making the tests easier to parametrize.

We can cross that bridge when we get there though; for now setting to 0 should work fine

ArrowBufferAppendInt32(ArrowArrayBuffer(array, 1), last_valid_offset));
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(
ArrowArrayBuffer(array, 2), (int32_t)child_length - last_valid_offset));
private_data->list_view_offset = (int32_t)child_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private_data->list_view_offset = (int32_t)child_length;
private_data->list_view_offset = child_length;

@codecov-commenter
Copy link

codecov-commenter commented Feb 26, 2025

Codecov Report

Attention: Patch coverage is 34.14634% with 27 lines in your changes missing coverage. Please review.

Project coverage is 78.92%. Comparing base (18652fc) to head (c8e2add).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
src/nanoarrow/testing/testing.cc 0.00% 14 Missing ⚠️
src/nanoarrow/common/inline_array.h 60.86% 1 Missing and 8 partials ⚠️
src/nanoarrow/common/inline_types.h 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #710      +/-   ##
==========================================
- Coverage   79.24%   78.92%   -0.33%     
==========================================
  Files         105      105              
  Lines       15065    15230     +165     
  Branches     1712     1732      +20     
==========================================
+ Hits        11939    12020      +81     
- Misses       2044     2120      +76     
- Partials     1082     1090       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the turnaround again, but thank you!

@WillAyd WillAyd merged commit e77a656 into apache:main Mar 4, 2025
36 checks passed
@WillAyd WillAyd deleted the list-view-impl branch March 4, 2025 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants