-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-17682: [C++][Python] Bool8 Extension Type Implementation #43488
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ part LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, but looks good otherwise.
python/pyarrow/array.pxi
Outdated
def to_numpy(self, zero_copy_only=True, writable=False): | ||
try: | ||
return self.storage.to_numpy().view(np.bool_) | ||
except ArrowInvalid as e: | ||
if zero_copy_only: | ||
raise e | ||
|
||
return _pc().not_equal(self.storage, 0).to_numpy(zero_copy_only=zero_copy_only, writable=writable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by _pc().not_equal(self.storage, 0)
. Isn't this creating a copy? Wasn't the purpose of bool8
to allow zero-copy with numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @westonpace. Yes the default path for the to_numpy()
method is to enforce zero-copy behavior which is achieved by the line return self.storage.to_numpy().view(np.bool_)
. The zero_copy_only
kwarg can optionally be set to False
which relaxes this requirement.
The line you indicated does create a copy, but it will only be reached if zero_copy_only
is False
AND the original attempt at a zero copy view failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in practice, this code path gets reached if there are missing values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct. The outcomes of taking the various paths are demonstrated in this test.
This also matches the existing semantics of converting a normal boolean array to numpy, which currently performs a copy to an array of dtype=np.object_
if there are any missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for the explanation!
Thank you for this, this is such an excellent addition ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joellubi added some quick comments, but generally looking good! Still need to check the tests
python/pyarrow/array.pxi
Outdated
def to_numpy(self, zero_copy_only=True, writable=False): | ||
try: | ||
return self.storage.to_numpy().view(np.bool_) | ||
except ArrowInvalid as e: | ||
if zero_copy_only: | ||
raise e | ||
|
||
return _pc().not_equal(self.storage, 0).to_numpy(zero_copy_only=zero_copy_only, writable=writable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in practice, this code path gets reached if there are missing values?
python/pyarrow/array.pxi
Outdated
buf = foreign_buffer(obj.ctypes.data, obj.size) | ||
return Array.from_buffers(bool8(), obj.size, [None, buf]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would loose track of the buffer owner (the numpy array obj
), so you would need to pass that to the foreign_buffer
function as base
argument.
However, I think we could also simplify this by first creating a pyarrow storage array of int8, and then using self.from_storage()
instead of using from_buffers()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a try and it works if the numpy array has dtype=np.int8
:
np_arr = np.array([1, 0, 1], dtype=np.int8)
pa_storage_arr = pa.array(np_arr, type=pa.int8())
pa_bool8_arr = pa.ExtensionArray.from_storage(pa.bool8(), pa_storage_arr)
This does not produce any copies. The existing approach of using foreign_buffer
also works with np_arr = np.array([True, False, True], dtype=np.bool_)
without making a copy.
However using the pa.array()
constuctor currently does make a copy when going bool -> int8. I think this would require a zero-copy casting kernel to be added to C++. That seems like it would be a better approach, I just have to wrap my head around that part of the code.
CC: @felipecrv does this sound right ^?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now that I think about it I don't think a casting kernel is what's needed in this specific scenario since that goes between Arrow types and we're not trying to convert Arrow Boolean to Arrow Int8. I think what we need is to reinterpret the numpy bool as a numpy int8, then continue the same way as above for the int8 arrow array. I'll give that a try now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I pushed up the change, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks good!
@pitrou I'll update that table in a follow-up PR. I made edits to it in #43679, so the addition will be easier once that PR has merged. |
@pitrou @jorisvandenbossche Any more comments on the C++ or Python sides respectively, or does this look ok to merge? |
return ss.str(); | ||
} | ||
|
||
std::string Bool8Type::Serialize() const { return ""; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm why is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what's specified in "description of the serialization" for Bool8.
This method is generally used to encode type parameters, but for bool8 there are no parameters. The type is fully defined by its name and storage type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I added a bunch more comments, but they are all just minor formatting / testing nits
python/pyarrow/types.pxi
Outdated
unknown_col: [[True, False, True, True, null]] | ||
unknown_col: [[-1,0,1,2,null]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: this is a good illustration for that we should ideally have a way to let the extension type control this string representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point and certainly something I would have liked to have when going through this implementation. I'll open an issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have #36648 covering that I think
|
||
|
||
def test_bool8_scalar(): | ||
assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I didn't think about in the previous round, but it might be better to test the value explicitly in this case, instead of relying on python's general truthiness:
assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py() | |
assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py() is True |
Because otherwise this test doesn't actually ensure that the result is True
or False
. If we were still returning the underlying storage of 0, 1, 2 etc, those tests would also pass in its current form.
(same for the lines below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, it reads a lot clearer now too.
|
||
|
||
def test_bool8_scalar(): | ||
assert not pa.ExtensionScalar.from_storage(pa.bool8(), 0).as_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding that support!
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5258819. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 26 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
C++ and Python implementations of #43234
What changes are included in this PR?
Bool8Type
,Bool8Array
,Bool8Scalar
, and testsAre these changes tested?
Yes
Are there any user-facing changes?
Bool8 extension type will be available in C++ and Python libraries