-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add __subclass__
parameter for high-precedence non-record nominal types
#2540
Conversation
Codecov Report
Additional details and impacted files
|
Custom strings will still need to have behaviour overloads for things like I think this is sensible. Another approach would be to have fallbacks, so that the array behaviours are used if the name doesn't define them. Doing that would mean that ufuncs operating upon a mixture of custom and default strings would work out of the box. I think that is undesirable - mixing types is general not well advised. There are more options here, but these are the main two that I can think of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'm hesitant about is the name __name__
. This will become an API forever, and if I were an outsider, I might think that __name__
is a display kind of thing, not a functional kind of thing. ("What would you like to call this array?") That is, a kind of documentation.
I suggested some alternatives like __behavior__
, __overload__
, and __override__
. If it's one of the latter two, it will be hard to remember which, since they're usually synonymous. __behavior__
may be too general, since __record__
also affects the behavior.
__array_behavior__
would narrow into exactly what this is about (though it begs the question as to why __record__
is not __record_behavior__
).
Maybe __derivation__
or __derived__
? Or __subclass__
because it specifically puts subclasses of ak.Array
onto the data and it's about making customized strings/bytes and customized categoricals.1
A long, long, long time ago, the word for this was going to be a "dressed" type. As opposed to undressed types. That seemed a little weird, though.
Before this goes into the wild, let's make sure we're not going to regret the name. If we pick "__subclass__
", swapping all the "__name__
" for "__subclass__
" in this PR will be a quick find-and-replace.
Footnotes
-
__subclass__
is my current favorite. ↩
Yes, this is unfortunate. I think the constraint at fault here is the existing use of
One thing to account for is that there is a reason to use If we chose instead to make |
__name__
parameter for high-precedence non-record nominal types__subclass__
parameter for high-precedence non-record nominal types
I've changed the PR to use That said, we don't currently prevent users from setting |
In fact, you switched mid-sentence between Oh! Setting awkward/src/awkward/contents/content.py Line 95 in 5543a01
This wouldn't be the first long-range invariant: strings need to have |
Yes, we will be rendering this parameter like
This is the part that I want to make sure we're on the same page about. I originally wanted to make the change Given that we're not making this change, however, in my mind I'm most fond of an |
This is how I've been understanding the intended future behavior (with some newfound clarity, having written and rewritten this comment many times):
Therefore, if we have the following, class IPAddress(bytes): # okay, this has to subclass from bytes, not ak.Array
def do_an_ip_address_thing(self, arg):
# compute something for a single IP address, like b'\xc0\x80\x01\x00' (192.128.1.0)
def __repr__(self):
return ".".join(map(str, np.frombuffer(self, "u1")))
class IPAddressArray(ak.Array):
def do_an_ip_address_thing(self, arg):
# compute something for an array of IP addresses (vectorized version of the above)
ak.behavior["IPAddress"] = IPAddress
ak.behavior["*", "IPAddress"] = IPAddressArray on a listnode-containing-numpynode-of-uint8 layout with Now that I think of it, there's no reason for Old:
New:
For overloads, we lose the In the example above, I included ak.behavior["*", "IPAddress"] = IPAddressArray which I just realized is possible, furthering the symmetry between Also, the class assigned through Footnotes
|
Purelist array classes
Currently the
OK, this aligns with my general feeling! This relaxation means we can speak about |
Dropping auxiliary behavior classesA problem with having Here are the methods for
There is a pattern that is increasingly becoming clear to me; we're trying to move away from the behavior system as a mechanism for implementing core features. Instead, I think we're reframing behaviors as intended to override built-in features, not implement it directly? Moving methods to
|
Purelist array classesYes, ak.behavior["*", "IPAddress"] = IPAddressArray would be a new thing to make |
It's easier to split these things into multiple comments... Though the above is subject to change (you might not like the direction!), I'm going to summarise the history / where my head is at with all of this.
(5) requires that we add built-in support to our existing overload mechanisms to recognise char/byte. I'm writing all of this in a vacuum, though. We currently allow users to define ufunc overloads for unnamed strings. I'm assuming that we can find data to show that this is not important enough to protect. |
Would they need this? I think it would be better if |
Remember that those % fgrep -r StringBehavior src/awkward --include="*.py"
src/awkward/behaviors/string.py:class ByteStringBehavior(Array):
src/awkward/behaviors/string.py:class StringBehavior(Array): They're defined, but not ever instantiated. #2528 was not a mistake. The generation of string representations, like >>> ak.Array(
... ak.contents.ListOffsetArray(
... ak.index.Index64([0, 3, 6]),
... ak.contents.NumpyArray(
... np.array([72, 65, 76, 73, 66, 77], "u1"),
... parameters={"__array__": "char"},
... ),
... parameters={"__array__": "string"},
... )
... )
<Array ['HAL', 'IBM'] type='2 * string'> are generated in a hard-coded way, not through the awkward/src/awkward/highlevel.py Lines 950 to 958 in 5543a01
and in awkward/src/awkward/_prettyprint.py Lines 43 to 49 in 5543a01
(because the pretty-printer bypasses So there aren't methods currently defined in |
|
Right, but the same is not true of
I agree. However, if we have |
Yes, we are on the same page about the 3 points above.
Number 4 is a new point, and I think we're in agreement that it's a good idea. I don't understand the comment about needing to make our existing overload mechanisms recognize strings. The overload mechanisms will (future) involve only
I had to re-read this to understand it, and I think we're on the same page about number 5 as well.
I understand that this would be breaking any current uses of the |
I see that % fgrep -r CharBehavior src/awkward --include="*.py"
src/awkward/behaviors/string.py:class CharBehavior(Array):
src/awkward/behaviors/string.py: if isinstance(other, (str, CharBehavior)):
src/awkward/behaviors/string.py: if isinstance(other, (str, CharBehavior)):
src/awkward/behaviors/string.py: if isinstance(other, (str, CharBehavior)):
src/awkward/behaviors/string.py: behavior["char"] = CharBehavior
% fgrep -r ByteBehavior src/awkward --include="*.py"
src/awkward/behaviors/string.py:class ByteBehavior(Array):
src/awkward/behaviors/string.py: if isinstance(other, (bytes, ByteBehavior)):
src/awkward/behaviors/string.py: if isinstance(other, (bytes, ByteBehavior)):
src/awkward/behaviors/string.py: if isinstance(other, (bytes, ByteBehavior)):
src/awkward/behaviors/string.py: behavior["byte"] = ByteBehavior If On the other hand, ak.to_layout("hello") is the other direction: it must return a char array that can be coerced to string, but isn't a string. |
I should say "char" / "byte". We implement certain features via behavior overloads (ufuncs), which would need to be folded into the core ufunc machinery if we don't treat "char" as a nominal type. |
Following from our meeting, we decided to introduce a new If we change the content rules such that |
Closing in favour of a new PR. |
This PR closes #2432 by making it possible to define custom types for strings (and categoricals). Now,
__array__
represents both nominal type and implementation type (string, categorical), and a new__subclass__
parameter represents a higher-precedence nominal type.