-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37437: [C++] Fix MakeArrayOfNull for list array with large string values type #37467
GH-37437: [C++] Fix MakeArrayOfNull for list array with large string values type #37467
Conversation
…tring values type
// An empty list array can have 0 offsets | ||
const auto required_offsets = (data.length > 0) ? data.length + data.offset + 1 : 0; | ||
const auto offsets_byte_size = data.buffers[1]->size(); | ||
const auto required_offsets = ((data.length > 0) || (offsets_byte_size > 0)) | ||
? data.length + data.offset + 1 | ||
: 0; | ||
if (offsets_byte_size / static_cast<int32_t>(sizeof(offset_type)) < | ||
required_offsets) { | ||
return Status::Invalid("Offsets buffer size (bytes): ", offsets_byte_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou do you know if this "empty list array can have 0 offsets" also applies to other types that have offsets, i.e. string types? (this was added in #6230)
In the current version of the PR, I am relying on ValidateFull to catch the error of creating a length-0 LargeString array with an offsets buffer of size 4, while it should be 8.
At the moment the offsets buffer size is not checked if the length of the array is 0, but I changed this to actually check it if the offset buffer has some length (so assuming that if there is an offsets buffer with a non-zero length, it should have a correct length, also if the array itself has length 0)
But if it's difficult to make guarantees about this / to change ValidateFull, I can also write a dedicated test for this specific case that doesn't rely on calling ValidateFull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou do you know if this "empty list array can have 0 offsets" also applies to other types that have offsets, i.e. string types? (this was added in #6230)
I don't think so. In any case, the allowance for list types is IMO historical.
At the moment the offsets buffer size is not checked if the length of the array is 0, but I changed this to actually check it if the offset buffer has some length (so assuming that if there is an offsets buffer with a non-zero length, it should have a correct length, also if the array itself has length 0)
That sounds like a good approach.
@@ -388,6 +388,7 @@ static std::vector<std::shared_ptr<DataType>> TestArrayUtilitiesAgainstTheseType | |||
large_utf8(), | |||
list(utf8()), | |||
list(int64()), // NOTE: Regression case for ARROW-9071/MakeArrayOfNull | |||
list(large_utf8()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add some nested list types into the mix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few more nested combinations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you!
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 1b262a2. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…tring values type (apache#37467) ### Rationale for this change `MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets. ### Are these changes tested? Yes * Closes: apache#37437 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring values type (apache#37467) ### Rationale for this change `MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets. ### Are these changes tested? Yes * Closes: apache#37437 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…tring values type (apache#37467) ### Rationale for this change `MakeArrayOfNull` for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets. ### Are these changes tested? Yes * Closes: apache#37437 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
MakeArrayOfNull
for list type was assuming that the values child field didn't need to be considered, but those values could also require a minimum buffer size (eg for offsets) and which could be of greater size than the list offsets if those are int32 offsets.Are these changes tested?
Yes