Awkward Array dtypes and mutating data in-place (in and out of Numba) #530

HDembinski · 2020-11-12T14:12:15Z

HDembinski
Nov 12, 2020
Maintainer

When writing generic transforms from awkward arrays to numpy arrays, it is necessary to get the dtype of the awkward array, in order to construct a matching numpy array. I could not find out how to get the dtype for an existing awkward array. I also need a way to get the dtype in Numba-compiled Python.

jpivarski · 2020-11-12T15:11:44Z

jpivarski
Nov 12, 2020
Maintainer

I'm also in a discussion about that here: data-apis/consortium-feedback#6 (the common array API development).

For arrays that can be converted to NumPy (i.e. ak.to_numpy does not raise an exception), it would be possible to talk about its dtype and shape, but not in the general case. I don't like the idea of a property that can raise exceptions—that would look like a bug, though it wouldn't be one in this case.

Since conversion to NumPy is zero-copy when it's not an error, np.asarray(awkward_array).dtype wouldn't be a performance hit.

I'm not sure what would be the right interface for this. Maybe ak.to_dtype(awkward_array.type)? Then such a function, ak.to_dtype could also raise exceptions, so it's not clearly better than np.asarray(awkward_array).dtype.

0 replies

HDembinski · 2020-11-12T18:25:25Z

HDembinski
Nov 12, 2020
Maintainer Author

ak.to_dtype is fine if you cannot have a property on the array (I don't understand, but I trust that you have a better overview, awkward1 is so general, it is difficult to reason about from my layman's perspective). np.asarray(awkward_array).dtype seems very inelegant. Would it work in numba?

0 replies

HDembinski · 2020-11-12T18:30:35Z

HDembinski
Nov 12, 2020
Maintainer Author

data-apis/consortium-feedback#6 seems to suggest that .dtype should be present when the represented data structure has a dtype.

0 replies

jpivarski · 2020-11-12T19:40:29Z

jpivarski
Nov 12, 2020
Maintainer

We're having that conversation. The array API depends crucially on .dtype and .shape properties, but it's not obvious that Awkward arrays should have them. There are several interface possibilities:

Awkward Array doesn't get involved in the array API effort.
Defining a subset of the array API that doesn't need .dtype and .shape, though this cuts deep into functionality.
Add .dtype and .shape to Awkward arrays and let them raise exceptions when they don't make sense. I'm afraid that asking for a property and getting an exception would look like a bug, even though it's intended to be important feedback to the user about the structure of their data.
Add .dtype and .shape to Awkward arrays and have them mean what the array would look like if forced into a rectilinear array, either by padding or dtype="O". This would require corresponding changes to ak.to_numpy (a.k.a. __array__) to perform that forcing. I was surprised to learn that's what pyarrow does—I think it's very unsafe. Not only are dtype="O" arrays slow, but they lack all the slicing features. I don't want users to end up mixing them and getting confused when things don't work. (That happened a lot with Uproot 3 ObjectArrays.)

So at the moment, I'm not sure what the best course of action is.

0 replies

HDembinski · 2020-11-13T10:37:37Z

HDembinski
Nov 13, 2020
Maintainer Author

What I want to do is take a ListOffsetArray, make a numpy array with the same length as the total size of ListOffsetArray (by which I mean len of layout.content) and the same dtype, which I then fill with computed content from the original ListOffsetArray. I need to write array transforms like this and I cannot assume that my ListOffsetArrays are all storing double. The transform must also work with Awkward arrays that store floats, too, and they should then also produce corresponding numpy arrays with the same dtype.

I think you need different layers of abstractions. Perhaps on the highest-level of abstraction only a subset of the numpy interface can exist, for example no dtype. But lower levels, such as ListOffsetArray, should have dtype.

I know very well from designing Boost Histogram how difficult it is to balance an abstraction that is very general and offers a very uniform interface over a large range of specific implementations. But as pointed out above, abstractions can be layered. An example are the different classes of iterators in C++, you can have random-access ones, bidirectional ones, unidirectional ones, and those which only allow either reading or writing. Together they form a hierarchy. As you go up in the abstraction hierarchy, fewer and fewer common operations are supported.

0 replies

jpivarski · 2020-11-13T16:34:50Z

jpivarski
Nov 13, 2020
Maintainer

We might be using words in different ways: the following ListOffsetArray can't have a dtype.

>>> array = ak.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}], [], [{"x": 3.3, "y": [1, 2, 3]}]])
>>> print(array)
[[{x: 1.1, y: [1]}, {x: 2.2, y: [1, 2]}], [], [{x: 3.3, y: [1, 2, 3]}]]
>>> array.type
3 * var * {"x": float64, "y": var * int64}
>>> array.layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 2 2 3]" offset="0" length="4" at="0x555b16c4f880"/></offsets>
    <content><RecordArray>
        <field index="0" key="x">
            <NumpyArray format="d" shape="3" data="1.1 2.2 3.3" at="0x555b16c51890"/>
        </field>
        <field index="1" key="y">
            <ListOffsetArray64>
                <offsets><Index64 i="[0 1 3 6]" offset="0" length="4" at="0x555b16c538a0"/></offsets>
                <content><NumpyArray format="l" shape="6" data="1 1 2 1 2 3" at="0x555b16c558b0"/></content>
            </ListOffsetArray64>
        </field>
    </RecordArray></content>
</ListOffsetArray64>

ListOffsetArrays can contain any other node type as their content. In this case, records. NumPy dtypes can say that "x" has float64 type and "y" has int64 type (i.e. dtype=[("x", np.float64), ("y", np.int64)]), but how do we say that "y" contains lists of integers and "x" contains non-lists of floats? Different parts of the array have different numbers of dimensions.

I think you might mean "ListOffsetArray whose content is NumpyArray," which is very much a corner case. Users posting on Uproot and Awkward's GitHub Issues frequently need more than that. If you have an application where you know that you only ever have this structure, you can build on that, but if I'm going to add dtype as a property on ak.Array, then I'll need to deal with the consequences of full generality.

Some libraries are designed as you've described: with layered abstractions as different types, such as C++'s basic iterators, random-access ones, bidirectional ones, etc. Making all of the node types visible in Awkward 0 (as well as using plain NumPy arrays when no structure was needed) was a mistake because it exposes details that are not relevant to analyzers doing data analysis. For instance, these two arrays have different layouts, but identical meaning from a data analysis point of view:

>>> one = ak.Array([[1, 2, 3], [999], [], [123, 123], [3, 4]])[[0, 2, 4]]
>>> two = ak.Array([[1, 2, 3], [], [3, 4]])
>>> one
<Array [[1, 2, 3], [], [3, 4]] type='3 * var * int64'>
>>> two
<Array [[1, 2, 3], [], [3, 4]] type='3 * var * int64'>
>>> one.layout
<ListArray64>
    <starts><Index64 i="[0 4 6]" offset="0" length="3" at="0x557adb2f1dd0"/></starts>
    <stops><Index64 i="[3 4 8]" offset="0" length="3" at="0x557adb2f1e10"/></stops>
    <content><NumpyArray format="l" shape="8" data="1 2 3 999 123 123 3 4" at="0x557adb2ed880"/></content>
</ListArray64>
>>> two.layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x557adb2f20e0"/></offsets>
    <content><NumpyArray format="l" shape="5" data="1 2 3 3 4" at="0x557adb2f6110"/></content>
</ListOffsetArray64>

The only thing that happened differently is that one was sliced and two was not. In Awkward 0, users accessed these nodes (or NumPy arrays as leaf nodes) directly, and frequently made mistakes with them because these differences have nothing to do with the fact that they both represent [[1, 2, 3], [], [4, 5]]. ListArray vs ListOffsetArray is a good illustration of the abstraction-layering that you described: ListArray is more general than ListOffsetArray, with more capabilities, just like bidirectional vs unidirectional iterators.

But whereas C++ users are encouraged to think about how much generality they need for a given algorithm, Python users are not encouraged to think about that: lists, tuples, sequences, and iterables fly around among duck-typed functions that someone wrote while thinking about a very different problem than how much abstraction they need. NumPy made a good design choice in presenting only one type, np.ndarray, without making users worry about whether they're C-contiguous, Fortran-contiguous, or something else, for normal usage (without also making it impossible to get at this information, when needed). Awkward 1 is following the same model: all non-scalars are ak.Array instances, type is a generalization of dtype and shape, and layout is a generalization of shape and strides.

0 replies

HDembinski · 2020-11-17T15:26:12Z

HDembinski
Nov 17, 2020
Maintainer Author

TTree has VARLEN arrays, which map to [[1, 2, 3,], [4], [5, 6]]. In my LHCb analysis, we exclusively use TTrees with this data structure. So for us it is not a corner case, it is the standard case. We build our whole analysis on the simplest hierarchical TTree in which each "event" contains values and VARLEN arrays. Processing them with uproot (not uproot4) worked very well. It is a little frightening that you consider us a corner case.

0 replies

HDembinski · 2020-11-17T15:37:38Z

HDembinski
Nov 17, 2020
Maintainer Author

I obviously don't understand the new awkward library as well as you do, so yes, I was not aware that there are ListOffsetArrays which can contain other ListOffsetArrays as "dtype". This nesting property is of course very nice and makes this super general.

I merely want to be able to get the dtype for a ListOffsetArray that contains a NumpyArray. My needs would be satisfied if there was a ak.dtype function that yielded the appropriate dtype for this special case and returns None or raises an exception (as you see fit) if the ListOffsetArray does not contain a NumpyArray. This function should ideally work from within numba-compiled code.

Edit: I would prefer if it returned None instead of raising exceptions.

0 replies

jpivarski · 2020-11-17T16:12:38Z

jpivarski
Nov 17, 2020
Maintainer

I shouldn't have used the word "corner case" because that word implies an unimportant case. That's not what I meant. This is an important case, but it's not more important than many of the others that have come up in issues. It's definitely not "on the fringe," but it's also not so overwhelmingly more central that other cases have to work around it.

Defining dtype in such a way that it means the Numbers in an Array[List[Numbers]] would make it unclear how to define it for an Array[List[List[Numbers]]] or an Array[Record[{x: Numbers1, y: Numbers2}]] (or an Array[Union[Numbers1, Numbers2]], though unions are treated as second-class citizens—they can't be used in Numba, for instance—unions are a "corner case," partly supported).

The big problem here is that NumPy started a convention of defining the types of arrays in terms of what is like a product of two descriptors, the shape and the dtype. Rectilinear arrays can be factorized into these two descriptors, but non-rectilinear arrays are not factorizable in this way. The whole project of "applying Numpy-like idioms to JSON-like data" (Awkward Array's reason for being) requires a non-factorized way of expressing types. Datashape was invented for this purpose (for the now-defunct DyND project, which was essentially what Awkward Array is), so we use that.

Knowing only the depth of your arrays, the following should always be able to get the content dtype, and it's a metadata-only operation (O(1) in the length of the array):

>>> def get_dtype(one_level_deep):
...     return np.asarray(one_level_deep[0:0]).dtype
... 
>>> get_dtype(ak.Array([[1, 2, 3], [], [4, 5]]))
dtype('int64')
>>> get_dtype(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
dtype('float64')

The advantage of this is that it does not rely on any details of the layout (ListArray/ListOffsetArray/IndexedArray), only your knowledge that it's one level deep. Since an empty array is sliced, it will always be rectilinear and never raise exceptions. Come to think of it, this would also work on deeper arrays because slicing length zero forces all sub and sub-sub lists to also have length zero, and hence it's always rectilinear. It won't work for records, missing data, or other structures, though. If it were a function in the Awkward Array library, it would have to have qualifications to explain all of that.

The trick is getting that to work in Numba, which relies on #509 and (transitively) Numba Discourse 338. From what I currently understand, I can see how to make an explicit np.asarray(·) work, but not the implicit ones.

This will work, though it is strictly limited to functions that are exactly one level deep and non-empty:

>>> @nb.njit
... def get_dtype(one_level_deep):
...     for nested_list in one_level_deep:
...         for item in nested_list:
...             return np.array(item).dtype
...     raise ValueError("it's empty!")
... 
>>> get_dtype(ak.Array([[1, 2, 3], [], [4, 5]]))
dtype('int64')
>>> get_dtype(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
dtype('float64')

and it can be used in other Numba functions without the entry/exit cost:

>>> @nb.njit
... def another_function(one_level_deep):
...     dt = get_dtype(one_level_deep)
...     return np.zeros(10, dt)
... 
>>> another_function(ak.Array([[1, 2, 3], [], [4, 5]]))
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> another_function(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

It is important for the get_dtype to raise an exception in the empty case because dtypes are Numba constants that can't be unified with each other. An exception has bottom type, which can be unified with anything.

Once I've implemented np.asarray(Awkward Array) in Numba, then this explicit

>>> @nb.njit
... def get_dtype(one_level_deep):
...     return np.asarray(one_level_deep).dtype

will work (it doesn't yet), and that open question on Numba's Discourse is about getting it to work implicitly.

0 replies

jpivarski · 2020-11-18T03:38:32Z

jpivarski
Nov 18, 2020
Maintainer

I'm losing track of the conversation between #528, #530 (this issue), and #532, but I think the following resolves everything we've talked about: in the master branch, you can now use np.array (always copies, exception if not rectilinear) and np.asarray (always views, only applies to 1D flat arrays) on Awkward Arrays in Numba-compiled functions. The np.asarray allows for mutability.

For example,

>>> @nb.njit
... def change(array):
...     for subarray in array:
...         np_view = np.asarray(subarray)
...         np_view *= 10
... 
>>> array = ak.Array([[1, 2, 3], [], [4, 5]])
>>> change(array)
>>> array
<Array [[10, 20, 30], [], [40, 50]] type='3 * var * int64'>
>>> change(array)
>>> array
<Array [[100, 200, 300], [], [400, 500]] type='3 * var * int64'>

The cast to a NumPy array has to be explicit (I'm still working on how to do implicit casts in Numba, if it's even possible), and the new array is a writable view that does not own the data:

>>> nparray.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
>>> nparray.base is None
True

I think it could segfault if you hold onto a reference of the NumPy array (outside of Numba; i.e. you return it) and the original Awkward Array gets garbage collected. I think that's in general true of NumPy arrays wrapping a pointer that they do not own (and don't have a convenient Python object to assign to the NumPy array's base). These should be used for short excursions—I'll explain that in the documentation.

Incidentally, the same code above works outside of Numba:

>>> array = ak.Array([[1, 2, 3], [], [4, 5]])
>>> for subarray in array:
...     np_view = np.asarray(subarray)
...     np_view *= 10
... 
>>> print(array)
[[10, 20, 30], [], [40, 50]]

But outside of Numba, I don't have the technical issue setting the base, so this array is safe.

>>> np_view.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

>>> np_view.base
<memory at 0x7fee9ef9cf40>

That memory is the Awkward NumpyArray.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Awkward Array dtypes and mutating data in-place (in and out of Numba) #530

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Awkward Array dtypes and mutating data in-place (in and out of Numba) #530

HDembinski Nov 12, 2020 Maintainer

Replies: 10 comments

jpivarski Nov 12, 2020 Maintainer

HDembinski Nov 12, 2020 Maintainer Author

HDembinski Nov 12, 2020 Maintainer Author

jpivarski Nov 12, 2020 Maintainer

HDembinski Nov 13, 2020 Maintainer Author

jpivarski Nov 13, 2020 Maintainer

HDembinski Nov 17, 2020 Maintainer Author

HDembinski Nov 17, 2020 Maintainer Author

jpivarski Nov 17, 2020 Maintainer

jpivarski Nov 18, 2020 Maintainer

HDembinski
Nov 12, 2020
Maintainer

jpivarski
Nov 12, 2020
Maintainer

HDembinski
Nov 12, 2020
Maintainer Author

HDembinski
Nov 12, 2020
Maintainer Author

jpivarski
Nov 12, 2020
Maintainer

HDembinski
Nov 13, 2020
Maintainer Author

jpivarski
Nov 13, 2020
Maintainer

HDembinski
Nov 17, 2020
Maintainer Author

HDembinski
Nov 17, 2020
Maintainer Author

jpivarski
Nov 17, 2020
Maintainer

jpivarski
Nov 18, 2020
Maintainer