-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44446: [C++][Python] Add mask creation helper #44447
base: main
Are you sure you want to change the base?
Conversation
|
f763a31
to
d14836f
Compare
@pitrou Do you have time to review? I think this is more or less done |
@@ -538,6 +538,51 @@ def repeat(value, size, MemoryPool memory_pool=None): | |||
return pyarrow_wrap_array(c_array) | |||
|
|||
|
|||
def mask(indices, length, MemoryPool memory_pool=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Does this API look ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late response, and the general API looks good, my one suggestion would be to use a bit more descriptive name. mask
could also be interpret in the active sense ("mask those values"), and eg pandas has a method with that name that work in that sense. Some more explicit options: create_mask
, make_mask
, mask_from_indices
, ..
This is essentially the counterpart for indices_nonzero
I think? (which converts a mask into indices)
Could maybe mention that in a "See Also" section in the docstring
@@ -915,6 +916,29 @@ Result<std::shared_ptr<Array>> MakeEmptyArray(std::shared_ptr<DataType> type, | |||
return builder->Finish(); | |||
} | |||
|
|||
Result<std::shared_ptr<Array>> MakeMaskArray(const std::vector<int64_t>& indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This std::vector
could be a span<int64_t>
for more flexibility. The selection vector could come from anywhere.
I think the two failures are unrelated to this branch.
And for R it seems related to datetimes
|
When you have the chance I'd appreciate if you could re-review this one @pitrou 🙂 |
ARROW_ASSIGN_OR_RAISE(auto buffer, AllocateBitmap(length, pool)); | ||
bit_util::SetBitsTo(buffer->mutable_data(), 0, length, false); | ||
for (int64_t i = 0; i < indices->length(); ++i) { | ||
int64_t index = indices->Value(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value could be null and nulls must be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer MakeMaskArray function already prevents values from being null, that's why it's not done in the Impl again.
if (indices->null_count() > 0) { | ||
return Status::Invalid("Indices array must not contain null values"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. If it takes an Arrow array of indices it should be able to handle nulls. The loop can be specialized based on the result of indices->MayHaveNulls()
so the common case doesn't have to check the validity bitmap of every iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason why it would make sense to accept null values, it verifies if there are null values and rejects the array in case there are because there doesn't seem to be a case where it would make sense to have nulls in an array of indices.
If the user has an array containing nulls it can be used by passing it to compute.drop_null
before using MakeMaskArray
.
/// \param[in] pool the memory pool to allocate memory from | ||
/// \return the resulting Array | ||
ARROW_EXPORT | ||
Result<std::shared_ptr<Array>> MakeMaskArray(const std::vector<int64_t>& indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this function doesn't need to take ownership of indices
, it can accept a span<int64_t>
instead of a std::vector<int64_t> &
so both vectors or any contiguous buffer if int64_t can be passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it would be a span<const int64_t>
then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought Arrow was constrained to C++17, isn't span a C++20 addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why we have a util::span
backport.
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
Rationale for this change
Implement a convenience method to create boolean masks where only some rows are set to
true
What changes are included in this PR?
C++
MakeMaskArray
factory function andpyarrow.mask
Python factory functionAre these changes tested?
Yes both C++ and Python
Are there any user-facing changes?
Yes, new API has been added