Awkward Array dtypes and mutating data in-place (in and out of Numba) #530
Replies: 10 comments
-
I'm also in a discussion about that here: data-apis/consortium-feedback#6 (the common array API development). For arrays that can be converted to NumPy (i.e. ak.to_numpy does not raise an exception), it would be possible to talk about its Since conversion to NumPy is zero-copy when it's not an error, I'm not sure what would be the right interface for this. Maybe |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
data-apis/consortium-feedback#6 seems to suggest that |
Beta Was this translation helpful? Give feedback.
-
We're having that conversation. The array API depends crucially on
So at the moment, I'm not sure what the best course of action is. |
Beta Was this translation helpful? Give feedback.
-
What I want to do is take a ListOffsetArray, make a numpy array with the same length as the total size of ListOffsetArray (by which I mean len of layout.content) and the same dtype, which I then fill with computed content from the original ListOffsetArray. I need to write array transforms like this and I cannot assume that my ListOffsetArrays are all storing double. The transform must also work with Awkward arrays that store floats, too, and they should then also produce corresponding numpy arrays with the same dtype. I think you need different layers of abstractions. Perhaps on the highest-level of abstraction only a subset of the numpy interface can exist, for example no dtype. But lower levels, such as ListOffsetArray, should have dtype. I know very well from designing Boost Histogram how difficult it is to balance an abstraction that is very general and offers a very uniform interface over a large range of specific implementations. But as pointed out above, abstractions can be layered. An example are the different classes of iterators in C++, you can have random-access ones, bidirectional ones, unidirectional ones, and those which only allow either reading or writing. Together they form a hierarchy. As you go up in the abstraction hierarchy, fewer and fewer common operations are supported. |
Beta Was this translation helpful? Give feedback.
-
We might be using words in different ways: the following ListOffsetArray can't have a dtype. >>> array = ak.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}], [], [{"x": 3.3, "y": [1, 2, 3]}]])
>>> print(array)
[[{x: 1.1, y: [1]}, {x: 2.2, y: [1, 2]}], [], [{x: 3.3, y: [1, 2, 3]}]]
>>> array.type
3 * var * {"x": float64, "y": var * int64}
>>> array.layout
<ListOffsetArray64>
<offsets><Index64 i="[0 2 2 3]" offset="0" length="4" at="0x555b16c4f880"/></offsets>
<content><RecordArray>
<field index="0" key="x">
<NumpyArray format="d" shape="3" data="1.1 2.2 3.3" at="0x555b16c51890"/>
</field>
<field index="1" key="y">
<ListOffsetArray64>
<offsets><Index64 i="[0 1 3 6]" offset="0" length="4" at="0x555b16c538a0"/></offsets>
<content><NumpyArray format="l" shape="6" data="1 1 2 1 2 3" at="0x555b16c558b0"/></content>
</ListOffsetArray64>
</field>
</RecordArray></content>
</ListOffsetArray64> ListOffsetArrays can contain any other node type as their I think you might mean "ListOffsetArray whose Some libraries are designed as you've described: with layered abstractions as different types, such as C++'s basic iterators, random-access ones, bidirectional ones, etc. Making all of the node types visible in Awkward 0 (as well as using plain NumPy arrays when no structure was needed) was a mistake because it exposes details that are not relevant to analyzers doing data analysis. For instance, these two arrays have different layouts, but identical meaning from a data analysis point of view: >>> one = ak.Array([[1, 2, 3], [999], [], [123, 123], [3, 4]])[[0, 2, 4]]
>>> two = ak.Array([[1, 2, 3], [], [3, 4]])
>>> one
<Array [[1, 2, 3], [], [3, 4]] type='3 * var * int64'>
>>> two
<Array [[1, 2, 3], [], [3, 4]] type='3 * var * int64'>
>>> one.layout
<ListArray64>
<starts><Index64 i="[0 4 6]" offset="0" length="3" at="0x557adb2f1dd0"/></starts>
<stops><Index64 i="[3 4 8]" offset="0" length="3" at="0x557adb2f1e10"/></stops>
<content><NumpyArray format="l" shape="8" data="1 2 3 999 123 123 3 4" at="0x557adb2ed880"/></content>
</ListArray64>
>>> two.layout
<ListOffsetArray64>
<offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x557adb2f20e0"/></offsets>
<content><NumpyArray format="l" shape="5" data="1 2 3 3 4" at="0x557adb2f6110"/></content>
</ListOffsetArray64> The only thing that happened differently is that But whereas C++ users are encouraged to think about how much generality they need for a given algorithm, Python users are not encouraged to think about that: lists, tuples, sequences, and iterables fly around among duck-typed functions that someone wrote while thinking about a very different problem than how much abstraction they need. NumPy made a good design choice in presenting only one type, |
Beta Was this translation helpful? Give feedback.
-
TTree has VARLEN arrays, which map to [[1, 2, 3,], [4], [5, 6]]. In my LHCb analysis, we exclusively use TTrees with this data structure. So for us it is not a corner case, it is the standard case. We build our whole analysis on the simplest hierarchical TTree in which each "event" contains values and VARLEN arrays. Processing them with uproot (not uproot4) worked very well. It is a little frightening that you consider us a corner case. |
Beta Was this translation helpful? Give feedback.
-
I obviously don't understand the new awkward library as well as you do, so yes, I was not aware that there are ListOffsetArrays which can contain other ListOffsetArrays as "dtype". This nesting property is of course very nice and makes this super general. I merely want to be able to get the dtype for a ListOffsetArray that contains a NumpyArray. My needs would be satisfied if there was a Edit: I would prefer if it returned |
Beta Was this translation helpful? Give feedback.
-
I shouldn't have used the word "corner case" because that word implies an unimportant case. That's not what I meant. This is an important case, but it's not more important than many of the others that have come up in issues. It's definitely not "on the fringe," but it's also not so overwhelmingly more central that other cases have to work around it. Defining The big problem here is that NumPy started a convention of defining the types of arrays in terms of what is like a product of two descriptors, the Knowing only the depth of your arrays, the following should always be able to get the content >>> def get_dtype(one_level_deep):
... return np.asarray(one_level_deep[0:0]).dtype
...
>>> get_dtype(ak.Array([[1, 2, 3], [], [4, 5]]))
dtype('int64')
>>> get_dtype(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
dtype('float64') The advantage of this is that it does not rely on any details of the layout (ListArray/ListOffsetArray/IndexedArray), only your knowledge that it's one level deep. Since an empty array is sliced, it will always be rectilinear and never raise exceptions. Come to think of it, this would also work on deeper arrays because slicing length zero forces all sub and sub-sub lists to also have length zero, and hence it's always rectilinear. It won't work for records, missing data, or other structures, though. If it were a function in the Awkward Array library, it would have to have qualifications to explain all of that. The trick is getting that to work in Numba, which relies on #509 and (transitively) Numba Discourse 338. From what I currently understand, I can see how to make an explicit This will work, though it is strictly limited to functions that are exactly one level deep and non-empty: >>> @nb.njit
... def get_dtype(one_level_deep):
... for nested_list in one_level_deep:
... for item in nested_list:
... return np.array(item).dtype
... raise ValueError("it's empty!")
...
>>> get_dtype(ak.Array([[1, 2, 3], [], [4, 5]]))
dtype('int64')
>>> get_dtype(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
dtype('float64') and it can be used in other Numba functions without the entry/exit cost: >>> @nb.njit
... def another_function(one_level_deep):
... dt = get_dtype(one_level_deep)
... return np.zeros(10, dt)
...
>>> another_function(ak.Array([[1, 2, 3], [], [4, 5]]))
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> another_function(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) It is important for the Once I've implemented >>> @nb.njit
... def get_dtype(one_level_deep):
... return np.asarray(one_level_deep).dtype will work (it doesn't yet), and that open question on Numba's Discourse is about getting it to work implicitly. |
Beta Was this translation helpful? Give feedback.
-
I'm losing track of the conversation between #528, #530 (this issue), and #532, but I think the following resolves everything we've talked about: in the master branch, you can now use For example, >>> @nb.njit
... def change(array):
... for subarray in array:
... np_view = np.asarray(subarray)
... np_view *= 10
...
>>> array = ak.Array([[1, 2, 3], [], [4, 5]])
>>> change(array)
>>> array
<Array [[10, 20, 30], [], [40, 50]] type='3 * var * int64'>
>>> change(array)
>>> array
<Array [[100, 200, 300], [], [400, 500]] type='3 * var * int64'> The cast to a NumPy array has to be explicit (I'm still working on how to do implicit casts in Numba, if it's even possible), and the new array is a writable view that does not own the data: >>> nparray.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> nparray.base is None
True I think it could segfault if you hold onto a reference of the NumPy array (outside of Numba; i.e. you return it) and the original Awkward Array gets garbage collected. I think that's in general true of NumPy arrays wrapping a pointer that they do not own (and don't have a convenient Python object to assign to the NumPy array's Incidentally, the same code above works outside of Numba: >>> array = ak.Array([[1, 2, 3], [], [4, 5]])
>>> for subarray in array:
... np_view = np.asarray(subarray)
... np_view *= 10
...
>>> print(array)
[[10, 20, 30], [], [40, 50]] But outside of Numba, I don't have the technical issue setting the >>> np_view.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> np_view.base
<memory at 0x7fee9ef9cf40> That |
Beta Was this translation helpful? Give feedback.
-
When writing generic transforms from awkward arrays to numpy arrays, it is necessary to get the dtype of the awkward array, in order to construct a matching numpy array. I could not find out how to get the dtype for an existing awkward array. I also need a way to get the dtype in Numba-compiled Python.
Beta Was this translation helpful? Give feedback.
All reactions