Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37437: [C++] Fix MakeArrayOfNull for list array with large string values type #37467

Merged
merged 5 commits into from
Oct 5, 2023

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 30, 2023

Rationale for this change

MakeArrayOfNull for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.

Are these changes tested?

Yes

Comment on lines 715 to 722
// An empty list array can have 0 offsets
const auto required_offsets = (data.length > 0) ? data.length + data.offset + 1 : 0;
const auto offsets_byte_size = data.buffers[1]->size();
const auto required_offsets = ((data.length > 0) || (offsets_byte_size > 0))
? data.length + data.offset + 1
: 0;
if (offsets_byte_size / static_cast<int32_t>(sizeof(offset_type)) <
required_offsets) {
return Status::Invalid("Offsets buffer size (bytes): ", offsets_byte_size,
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou do you know if this "empty list array can have 0 offsets" also applies to other types that have offsets, i.e. string types? (this was added in #6230)

In the current version of the PR, I am relying on ValidateFull to catch the error of creating a length-0 LargeString array with an offsets buffer of size 4, while it should be 8.
At the moment the offsets buffer size is not checked if the length of the array is 0, but I changed this to actually check it if the offset buffer has some length (so assuming that if there is an offsets buffer with a non-zero length, it should have a correct length, also if the array itself has length 0)

But if it's difficult to make guarantees about this / to change ValidateFull, I can also write a dedicated test for this specific case that doesn't rely on calling ValidateFull.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou do you know if this "empty list array can have 0 offsets" also applies to other types that have offsets, i.e. string types? (this was added in #6230)

I don't think so. In any case, the allowance for list types is IMO historical.

At the moment the offsets buffer size is not checked if the length of the array is 0, but I changed this to actually check it if the offset buffer has some length (so assuming that if there is an offsets buffer with a non-zero length, it should have a correct length, also if the array itself has length 0)

That sounds like a good approach.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 30, 2023
@@ -388,6 +388,7 @@ static std::vector<std::shared_ptr<DataType>> TestArrayUtilitiesAgainstTheseType
large_utf8(),
list(utf8()),
list(int64()), // NOTE: Regression case for ARROW-9071/MakeArrayOfNull
list(large_utf8()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add some nested list types into the mix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few more nested combinations

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 29, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 5, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you!

@pitrou pitrou merged commit 1b262a2 into apache:main Oct 5, 2023
34 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Oct 5, 2023
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Oct 5, 2023
@jorisvandenbossche jorisvandenbossche deleted the gh-37437-makearrayofnulls branch October 5, 2023 15:31
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 1b262a2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…tring values type (apache#37467)

### Rationale for this change

`MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.

### Are these changes tested?

Yes

* Closes: apache#37437

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…tring values type (apache#37467)

### Rationale for this change

`MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.

### Are these changes tested?

Yes

* Closes: apache#37437

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…tring values type (apache#37467)

### Rationale for this change

`MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.

### Are these changes tested?

Yes

* Closes: apache#37437

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants