Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add __subclass__ parameter for high-precedence non-record nominal types #2540

Closed
wants to merge 11 commits into from

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jun 23, 2023

This PR closes #2432 by making it possible to define custom types for strings (and categoricals). Now, __array__ represents both nominal type and implementation type (string, categorical), and a new __subclass__ parameter represents a higher-precedence nominal type.

@codecov
Copy link

codecov bot commented Jun 23, 2023

Codecov Report

Merging #2540 (7e079c2) into main (7b22fd1) will decrease coverage by 0.01%.
The diff coverage is 89.15%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_parameters.py 82.95% <57.14%> (+0.19%) ⬆️
src/awkward/contents/content.py 76.11% <60.00%> (+0.08%) ⬆️
src/awkward/_behavior.py 93.22% <96.55%> (-0.33%) ⬇️
src/awkward/_connect/numpy.py 94.44% <100.00%> (-0.10%) ⬇️
src/awkward/contents/recordarray.py 85.15% <100.00%> (ø)
src/awkward/highlevel.py 76.82% <100.00%> (ø)

... and 1 file with indirect coverage changes

@agoose77 agoose77 temporarily deployed to docs-preview June 23, 2023 12:34 — with GitHub Actions Inactive
@agoose77 agoose77 marked this pull request as ready for review June 23, 2023 16:29
@agoose77 agoose77 requested a review from jpivarski June 23, 2023 16:29
@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 23, 2023

Custom strings will still need to have behaviour overloads for things like np.equal, but the layout-level logic for handling strings should work with custom and default strings.

I think this is sensible. Another approach would be to have fallbacks, so that the array behaviours are used if the name doesn't define them. Doing that would mean that ufuncs operating upon a mixture of custom and default strings would work out of the box. I think that is undesirable - mixing types is general not well advised.

There are more options here, but these are the main two that I can think of.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I'm hesitant about is the name __name__. This will become an API forever, and if I were an outsider, I might think that __name__ is a display kind of thing, not a functional kind of thing. ("What would you like to call this array?") That is, a kind of documentation.

I suggested some alternatives like __behavior__, __overload__, and __override__. If it's one of the latter two, it will be hard to remember which, since they're usually synonymous. __behavior__ may be too general, since __record__ also affects the behavior.

__array_behavior__ would narrow into exactly what this is about (though it begs the question as to why __record__ is not __record_behavior__).

Maybe __derivation__ or __derived__? Or __subclass__ because it specifically puts subclasses of ak.Array onto the data and it's about making customized strings/bytes and customized categoricals.1

A long, long, long time ago, the word for this was going to be a "dressed" type. As opposed to undressed types. That seemed a little weird, though.

Before this goes into the wild, let's make sure we're not going to regret the name. If we pick "__subclass__", swapping all the "__name__" for "__subclass__" in this PR will be a quick find-and-replace.

Footnotes

  1. __subclass__ is my current favorite.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 24, 2023

This will become an API forever, and if I were an outsider, I might think that name is a display kind of thing, not a functional kind of thing. (

Yes, this is unfortunate. I think the constraint at fault here is the existing use of __array__; it is already a good name for what we want to implement.

Maybe derivation or derived? Or subclass because it specifically puts subclasses of ak.Array onto the data and it's about making customized strings/bytes and customized categoricals.

One thing to account for is that there is a reason to use __name__ for non-string/categoricals. This PR reorients things so that __array__ is only used for strings / categoricals. The motivation for that was to avoid the '__array__:' 'string' property differing between built-in and custom strings i.e to keep existing string-aware logic invariant.

If we chose instead to make __name__ meaningful only for strings/categoricals, then users/library authors would need to do more work before successfully adding a nominal type to an arbitrary layout, i.e. check whether it's a string/categorical, and act accordingly.

@agoose77 agoose77 temporarily deployed to docs-preview June 26, 2023 11:54 — with GitHub Actions Inactive
@agoose77 agoose77 changed the title feat: add __name__ parameter for high-precedence non-record nominal types feat: add __subclass__ parameter for high-precedence non-record nominal types Jun 26, 2023
@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 26, 2023

I've changed the PR to use __subclass__. I think my preferred solution is to make the existing usages of __array__ a different parameter like __kind__, and seeing as that's not on the table, I have no strong feelings on the __subtype__ __subclass__ name!

That said, we don't currently prevent users from setting __subtype__ __subclass__ if __array__ is not set. This might cause some confusion / bad practices.

@jpivarski
Copy link
Member

In fact, you switched mid-sentence between __subclass__ and __subtype__. But since this points to a class that inherits from ak.Array, using "class" in the name is perhaps better.

Oh! __array_class__? It doesn't matter that it's long; these are entirely targeted at library developers, not at all at casual users. This spelling ties in the fact that it needs to be coordinated with __array__.

Setting __array_class__ without __array__ doesn't make sense, and we can check for it in

def _init(self, parameters: dict[str, Any] | None, backend: Backend):

This wouldn't be the first long-range invariant: strings need to have __array__: "string" and their content's __array__: "char".

@agoose77
Copy link
Collaborator Author

Oh! array_class? It doesn't matter that it's long;

Yes, we will be rendering this parameter like __array__ is currently, i.e. var * NAME.

Setting __array_class__ without __array__ doesn't make sense, and we can check for it in

This is the part that I want to make sure we're on the same page about.

I originally wanted to make the change __array____kind__ to make it clear that __kind__ is fundamentally doing a different thing to __array__. In future, strings will drop the string behavior classes, meaning that __kind__ can be a non-behavior-class parameter. __kind__ would still have affected ufuncs etc, but that can all be explained in a table.

Given that we're not making this change, however, in my mind __array__ and __array_class__ are semantically identical apart from the places where we're asking about string-ness or categorical-ness; these places only look at __array__. For everything that isn't string/categorical specific, they're just higher-lower precedence names for __array__. Therefore, we prohibit setting __array_class__ without __array__.

I'm most fond of an __array_XXX__ name; in fact, I also proposed something like this at one point, but I can't see where... I've rewritten these PRs a few times now :P

@jpivarski
Copy link
Member

jpivarski commented Jun 26, 2023

This is how I've been understanding the intended future behavior (with some newfound clarity, having written and rewritten this comment many times):

__array__ has only 6 meaningful values: "string", "bytestring", "char", "byte", "categorical", and "sorted_map", which are all of the values that Awkward and Uproot have ever used for this parameter. Whenever the value of __array__ has some consequences for how the array works (in __getitem__, to_list, broadcasting, Arrow-conversion, validity-checking, etc.), those consequences will be hard-coded, not configurable through ak.behavior. Therefore, __array__ does not determine what Python class gets applied to a layout when that layout is at the top of the tree, because everything that it does is hard-coded.

__array_class__ determines what Python class gets applied to a layout when that layout is at the top of the tree. Therefore, __array_class__ is "the array/non-record equivalent of __record__."

Therefore, if we have the following,

class IPAddress(bytes):   # okay, this has to subclass from bytes, not ak.Array
    def do_an_ip_address_thing(self, arg):
        # compute something for a single IP address, like b'\xc0\x80\x01\x00' (192.128.1.0)
    def __repr__(self):
        return ".".join(map(str, np.frombuffer(self, "u1")))

class IPAddressArray(ak.Array):
    def do_an_ip_address_thing(self, arg):
        # compute something for an array of IP addresses (vectorized version of the above)

ak.behavior["IPAddress"] = IPAddress
ak.behavior["*", "IPAddress"] = IPAddressArray

on a listnode-containing-numpynode-of-uint8 layout with __array__: "bytestring" and __array_class__: "IPAddress" would make the array act like an array of bytestrings with the do_an_ip_address_thing method and repr. IP addresses are an Awkward subclass of bytestrings because __array__ gives it the hard-coded bytestring behaviors and __array_class__ gives library developers a way to define new methods.

Now that I think of it, there's no reason for __array_class__ to be limited to nodes that also define __array__. Being string-like/categorical-like/map-like though hard-coded, built-in code is orthogonal to what Python class gets overlaid on nodes that are at the top of the layout. It used to be that __array__ did both, but now we want to split up these two actions into two parameters.

Old:

  • __array__ has hard-coded consequences for some of its values.
  • __array__ also determines which Python class to to wrap a layout with.

New:

  • __array__ has hard-coded consequences for some of its values.
  • __array_class__ determines which Python class to to wrap a layout with.

For overloads, we lose the __array__/__record__ symmetry in favor of a less elegant __array_class__/__record__ pairing1, but the hard-coded behaviors that have become established in __array__ do not need to change.


In the example above, I included

ak.behavior["*", "IPAddress"] = IPAddressArray

which I just realized is possible, furthering the symmetry between __array_class__ and __record__ in a useful way. (If we had an IPAddress override, we would definitely want to vectorize its methods.)

Also, the class assigned through __array_class__ would have to be a subclass of str if __array__ is "string" or "char", it would have to be a subclass of bytes if __array__ is "bytestring" or "byte", and it would have to be a subclass of ak.Array otherwise. That's a long-range consequence of string/bytestring's hard-coded __getitem__ behavior, but not something that needs to be explicitly checked. (Library developers either get it right or they get an error.)

Footnotes

  1. If we really want to, we can introduce __record_class__ as the new spelling of __record__ and very slowly or never deprecate __record__. The deprecation period would have to be a year, and we shouldn't introduce warnings until after the summer is over.

@agoose77
Copy link
Collaborator Author

Purelist array classes

In the example above, I included

ak.behavior["*", "IPAddress"] = IPAddressArray

Currently the "*", NAME syntax applies only to arrays-of-records. Are you proposing that we increase the symmetry, and resolve __array__ at any list depth? I think this would be useful.

Now that I think of it, there's no reason for __array_class__ to be limited to nodes that also define __array__

OK, this aligns with my general feeling! This relaxation means we can speak about __array__ in a limited set of circumstances, and user/library code can safely set __array_class__.

@agoose77
Copy link
Collaborator Author

Dropping auxiliary behavior classes

A problem with having __array__ not affect the choice of array class is that byte and char currently have behavior classes that do something useful. (Incidentally, it appears I've made a mistake in #2528, because the byte/char classes are marked as deprecated).

Here are the methods for chat/byte:

  • __bytes__
  • __str__
  • __repr__
  • __iter__
  • __eq__
  • __neq__
  • __add__
  • __radd__

There is a pattern that is increasingly becoming clear to me; we're trying to move away from the behavior system as a mechanism for implementing core features. Instead, I think we're reframing behaviors as intended to override built-in features, not implement it directly?
If so, I think we want to remove the need for these ByteBehavior and CharBehavior classes:

Moving methods to ak.Array?

To solve this, we could extend ak.Array to define byte/char-aware __bytes__, __str__, and __repr__ (although I actually think __repr__ should not be defined specially for char/byte; it should be true to the actual (array) object).

Moving methods to ak.behavior?

We can also define ufunc overloads for the various ufuncs. That might be tricky; right now, Python str objects are cast (via __cast__) to "string", not "char". This means that my_array_of_char == "word" which would throw an exception. If we change this to produce "char", then the same is true for my_array_of_strings == "word". I'll give this problem some more thought, as all of these things are somewhat interconnected.

Don't use ak.behavior?

I might be ad-nauseam in my explanation here, but the one point I really want to keep alive is that __array_class__ and __record__ don't just set __class__, they also (orthogonally) determine ufunc overloads etc. We can use one of those features without the other.

But, should we in this case? This PR is moving in the direction that strings are a fundamental part of the Awkward type system. In fact, if we had no technical debt we could argue that string_type should be an attribute of ListOffsetArray and ListOffsetForm, rather than a parameter. However, that ship has long sailed. Stopping array-class resolution from considering __array__ implies to me that the same should hold true for ufuncs and other features. In other words, I'm suggesting that our __array__ no longer be a nominal type in any sense.

@jpivarski
Copy link
Member

Purelist array classes

Yes,

ak.behavior["*", "IPAddress"] = IPAddressArray

would be a new thing to make __array_class__ and __record__ more symmetric. Afterward, it may be that ak.mixin_class could be extended to apply to both record and array behaviors, but the fact that some array behavior classes would need to extend str and bytes makes that tricky.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 26, 2023

It's easier to split these things into multiple comments...

Though the above is subject to change (you might not like the direction!), I'm going to summarise the history / where my head is at with all of this.

  1. Strings are complex enough that we decided to make them first-party citizens rather than add-ons that users could have created (feat!: drop string broadcasting overloading #2474)
  2. We decided that users should be able to define custom strings that obey intrinsic string rules (like broadcasting), but allow additional behaviors and ufunc/reducer overloads (Simplify customisation that is used to implement strings #2432)
  3. Here, we decided that __array__, the mechanism that supports strings/categoricals, should not be used for type-class resolution. We added a new __array_class__ that has much greater symmetry with __record__
  4. We propose that existing purelist array-class lookup also support names from __array_class__, not just __record__.
  5. Here, I propose that __array__ also not be considered for any nominal lookups, finalising the move towards a built-in type

(5) requires that we add built-in support to our existing overload mechanisms to recognise char/byte.

I'm writing all of this in a vacuum, though. We currently allow users to define ufunc overloads for unnamed strings. I'm assuming that we can find data to show that this is not important enough to protect.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 26, 2023

but the fact that some array behavior classes would need to extend str and bytes makes that tricky.

Would they need this? I think it would be better if char and byte classes were ak.Arrays, but implemented __bytes__ and __str__ like we currently do

@jpivarski
Copy link
Member

Remember that those StringBehavior and ByteStringBehavior classes are not used anywhere:

% fgrep -r StringBehavior src/awkward --include="*.py"
src/awkward/behaviors/string.py:class ByteStringBehavior(Array):
src/awkward/behaviors/string.py:class StringBehavior(Array):

They're defined, but not ever instantiated. #2528 was not a mistake.

The generation of string representations, like 'HAL' and 'IBM' in

>>> ak.Array(
...     ak.contents.ListOffsetArray(
...         ak.index.Index64([0, 3, 6]),
...         ak.contents.NumpyArray(
...             np.array([72, 65, 76, 73, 66, 77], "u1"),
...             parameters={"__array__": "char"},
...         ),
...         parameters={"__array__": "string"},
...     )
... )
<Array ['HAL', 'IBM'] type='2 * string'>

are generated in a hard-coded way, not through the StringBehavior. It happens in ak.Array.__getitem__ here:

out = self._layout[where]
if isinstance(out, ak.contents.NumpyArray):
array_param = out.parameter("__array__")
if array_param == "byte":
return ak._util.tobytes(out._raw(numpy))
elif array_param == "char":
return ak._util.tobytes(out._raw(numpy)).decode(
errors="surrogateescape"
)

and in ak._prettyprint.get_at here:

out = data._layout._getitem_at(index)
if isinstance(out, ak.contents.NumpyArray):
array_param = out.parameter("__array__")
if array_param == "byte":
return ak._util.tobytes(out._raw(numpy))
elif array_param == "char":
return ak._util.tobytes(out._raw(numpy)).decode(errors="surrogateescape")

(because the pretty-printer bypasses ak.Array.__getitem__).

So there aren't methods currently defined in StringBehavior and ByteStringBehavior that need to be taken over; they were already taken over.

@jpivarski
Copy link
Member

I think it would be better if char and byte classes were ak.Arrays, but implemented __bytes__ and __str__

ak.Array.__getitem__ returns str and bytes for a single string or a single bytestring because it was requested: #873. Similarly, it returns None for missing values instead of an ak.masked.

@agoose77
Copy link
Collaborator Author

Remember that those StringBehavior and ByteStringBehavior classes are not used anywhere:

Right, but the same is not true of CharBehavior or ByteBehavior

I think it would be better if char and byte classes were ak.Arrays, but implemented __bytes__ and __str__

ak.Array.__getitem__ returns str and bytes for a single string or a single bytestring because it was requested: #873. Similarly, it returns None for missing values instead of an ak.masked.

I agree. However, if we have __array__: "char" layouts, then I think they should not be subclasses of byte/char. If you do ak.to_layout("hello"), it should be a char array that can be coerced to string, but isn't a string.

@jpivarski
Copy link
Member

Though the above is subject to change (you might not like the direction!), I'm going to summarise the history / where my head is at with all of this.

  1. Strings are complex enough that we decided to make them first-party citizens rather than add-ons that users could have created (feat!: drop string broadcasting overloading #2474)
  2. We decided that users should be able to define custom strings that obey intrinsic string rules (like broadcasting), but allow additional behaviors and ufunc/reducer overloads (Simplify customisation that is used to implement strings #2432)
  3. Here, we decided that __array__, the mechanism that supports strings/categoricals, should not be used for type-class resolution. We added a new __array_class__ that has much greater symmetry with __record__

Yes, we are on the same page about the 3 points above.

  1. We propose that existing purelist array-class lookup also support names from __array_class__, not just __record__.
    (4) requires that we add built-in support to our existing overload mechanisms to recognise strings.

Number 4 is a new point, and I think we're in agreement that it's a good idea. I don't understand the comment about needing to make our existing overload mechanisms recognize strings. The overload mechanisms will (future) involve only __array_class__, and the string-handling (now) involves only __array__. These two concerns have become independent.

  1. Here, I propose that __array__ also not be considered for any nominal lookups, finalising the move towards a built-in type

I had to re-read this to understand it, and I think we're on the same page about number 5 as well. __array__ will also not be considered for any nominal lookups; __array__ will be used exclusively for built-in behaviors.

I'm writing all of this in a vacuum, though. We currently allow users to define ufunc overloads for unnamed strings. I'm assuming that we can find data to show that this is not important enough to protect.

I understand that this would be breaking any current uses of the __array__ parameter, and I know that you're an example of one user who used it (though you were a power-user, even before you joined the team!). In my scan of API usage, I didn't even find any examples of record-overloading outside of Coffea and Vector, and __array__ (now __array_class__) overloading is more difficult with more constrained use-cases. It's also impossible to put into a deprecation cycle (the way that __record____record_class__ could be, if we want). So this breaking change appears to be safe and we can't do it in a non-breaking way, anyway.

@jpivarski
Copy link
Member

I see that CharBehavior and ByteBehavior haven't been extracted yet, but I think they should be.

% fgrep -r CharBehavior src/awkward --include="*.py"
src/awkward/behaviors/string.py:class CharBehavior(Array):
src/awkward/behaviors/string.py:        if isinstance(other, (str, CharBehavior)):
src/awkward/behaviors/string.py:        if isinstance(other, (str, CharBehavior)):
src/awkward/behaviors/string.py:        if isinstance(other, (str, CharBehavior)):
src/awkward/behaviors/string.py:    behavior["char"] = CharBehavior
% fgrep -r ByteBehavior src/awkward --include="*.py"
src/awkward/behaviors/string.py:class ByteBehavior(Array):
src/awkward/behaviors/string.py:        if isinstance(other, (bytes, ByteBehavior)):
src/awkward/behaviors/string.py:        if isinstance(other, (bytes, ByteBehavior)):
src/awkward/behaviors/string.py:        if isinstance(other, (bytes, ByteBehavior)):
src/awkward/behaviors/string.py:    behavior["byte"] = ByteBehavior

If array[100] returns an arbitrary-length str, then I'd strongly expect array[100, 5] to return a length-1 str. In particular, I'd expect array[100][5] and array[100, 5] to return the same thing (not something that can be coerced to the same thing).

On the other hand,

ak.to_layout("hello")

is the other direction: it must return a char array that can be coerced to string, but isn't a string. ak.Array.__getitem__ takes things out of the Awkward system and ak.to_layout puts them into it.

@agoose77
Copy link
Collaborator Author

I don't understand the comment about needing to make our existing overload mechanisms recognize strings. The overload mechanisms will (future) involve only array_class, and the string-handling (now) involves only array. These two concerns have become independent.

I should say "char" / "byte". We implement certain features via behavior overloads (ufuncs), which would need to be folded into the core ufunc machinery if we don't treat "char" as a nominal type.

@agoose77 agoose77 marked this pull request as draft June 28, 2023 15:11
@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 28, 2023

Following from our meeting, we decided to introduce a new __list__ name instead of __subclass__. This will be scoped only to list types, and __record__ should be scoped only to records. Additionally, ("*", "name") should resolve lists to any depth (stopping at records!).

If we change the content rules such that __record__ and __list__ can only be added to their respective types, then we can create a new any_purelist_parameter function that finds the outermost of several parameters. This would be used to resolve the array class.

@agoose77
Copy link
Collaborator Author

Closing in favour of a new PR.

@agoose77 agoose77 closed this Jun 29, 2023
@agoose77 agoose77 deleted the agoose77/feat-array-name branch September 17, 2023 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify customisation that is used to implement strings
2 participants