Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44446: [C++][Python] Add mask creation helper #44447

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

amol-
Copy link
Member

@amol- amol- commented Oct 16, 2024

Rationale for this change

Implement a convenience method to create boolean masks where only some rows are set to true

What changes are included in this PR?

C++ MakeMaskArray factory function and pyarrow.mask Python factory function

Are these changes tested?

Yes both C++ and Python

Are there any user-facing changes?

Yes, new API has been added

@amol- amol- changed the title [GH-44446] Add mask creation helper GH-44446: Add mask creation helper Oct 16, 2024
@apache apache deleted a comment from github-actions bot Oct 16, 2024
Copy link

⚠️ GitHub issue #44446 has been automatically assigned in GitHub to PR creator.

@amol- amol- force-pushed the GH-44446 branch 2 times, most recently from f763a31 to d14836f Compare October 16, 2024 22:55
@kou kou changed the title GH-44446: Add mask creation helper GH-44446: [C++][Python] Add mask creation helper Oct 17, 2024
@amol-
Copy link
Member Author

amol- commented Oct 17, 2024

@pitrou Do you have time to review? I think this is more or less done

cpp/src/arrow/array/util.cc Outdated Show resolved Hide resolved
@@ -538,6 +538,51 @@ def repeat(value, size, MemoryPool memory_pool=None):
return pyarrow_wrap_array(c_array)


def mask(indices, length, MemoryPool memory_pool=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Does this API look ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late response, and the general API looks good, my one suggestion would be to use a bit more descriptive name. mask could also be interpret in the active sense ("mask those values"), and eg pandas has a method with that name that work in that sense. Some more explicit options: create_mask, make_mask, mask_from_indices, ..

This is essentially the counterpart for indices_nonzero I think? (which converts a mask into indices)
Could maybe mention that in a "See Also" section in the docstring

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 17, 2024
python/pyarrow/array.pxi Outdated Show resolved Hide resolved
@@ -915,6 +916,29 @@ Result<std::shared_ptr<Array>> MakeEmptyArray(std::shared_ptr<DataType> type,
return builder->Finish();
}

Result<std::shared_ptr<Array>> MakeMaskArray(const std::vector<int64_t>& indices,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This std::vector could be a span<int64_t> for more flexibility. The selection vector could come from anywhere.

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 21, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 27, 2024
@amol-
Copy link
Member Author

amol- commented Dec 2, 2024

I think the two failures are unrelated to this branch.
For Java it seems to be a memory leak in FlightSQL

Error:    TestFlightSqlStreams.tearDown:238 » IllegalState Memory was leaked by query. Memory leaked: (81920)

And for R it seems related to datetimes

── Failure ('test-dplyr-funcs-datetime.R:302:5'): timestamp round trip correctly via strftime and strptime ──

@amol-
Copy link
Member Author

amol- commented Dec 3, 2024

When you have the chance I'd appreciate if you could re-review this one @pitrou 🙂

cpp/src/arrow/array/util.cc Outdated Show resolved Hide resolved
ARROW_ASSIGN_OR_RAISE(auto buffer, AllocateBitmap(length, pool));
bit_util::SetBitsTo(buffer->mutable_data(), 0, length, false);
for (int64_t i = 0; i < indices->length(); ++i) {
int64_t index = indices->Value(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value could be null and nulls must be skipped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer MakeMaskArray function already prevents values from being null, that's why it's not done in the Impl again.

Comment on lines +951 to +953
if (indices->null_count() > 0) {
return Status::Invalid("Indices array must not contain null values");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. If it takes an Arrow array of indices it should be able to handle nulls. The loop can be specialized based on the result of indices->MayHaveNulls() so the common case doesn't have to check the validity bitmap of every iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason why it would make sense to accept null values, it verifies if there are null values and rejects the array in case there are because there doesn't seem to be a case where it would make sense to have nulls in an array of indices.
If the user has an array containing nulls it can be used by passing it to compute.drop_null before using MakeMaskArray.

/// \param[in] pool the memory pool to allocate memory from
/// \return the resulting Array
ARROW_EXPORT
Result<std::shared_ptr<Array>> MakeMaskArray(const std::vector<int64_t>& indices,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this function doesn't need to take ownership of indices, it can accept a span<int64_t> instead of a std::vector<int64_t> & so both vectors or any contiguous buffer if int64_t can be passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it would be a span<const int64_t> then :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Arrow was constrained to C++17, isn't span a C++20 addition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why we have a util::span backport.

cpp/src/arrow/array/util.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 6, 2024
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Dec 9, 2024
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting committer review Awaiting committer review awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review awaiting change review Awaiting change review awaiting review Awaiting review labels Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants