-
Dear experts, I cannot find a straightforward way to make a calculation using ak array operations for a particular task. I was wondering whether I am missing something or you could point me to the relevant functions to do such a task. You can find below the description of the problem. I have the following two ak arrays, for instance: ak_array_1 = ak.Array([[1, 6], [2]])
ak_array_2 = ak.Array([[0, 4, 7], [2]]) And for each element of
0 is closest to 1 and the difference is 1, 4 is closest to 6 and the difference is 2 etc... I found a solution for it (see below), but it is quite convoluted, and I was wondering if there exists a simpler solution for it for idx in range(ak.max(ak.num(ak_array_2, axis=1), axis=0)):
ak_array_2_masked = ak.mask(ak_array_2, ak.count(ak_array_2, axis=1)>idx)
ak_array_2_one_index = ak_array_2_masked[:, idx][:, np.newaxis]
ak_array_2_broadcasted = ak.broadcast_arrays(ak_array_2_one_index, ak_array_1)[0]
distances = abs(ak_array_2_broadcasted - ak_array_1)
output = ak.Array([ak.min(distances, axis=1)])
if idx == 0:
output_array = output
else:
output_array = ak.concatenate((output_array, output), axis=0)
# Inverting axis 0 and 1
output_array = ak.from_regular(ak.to_numpy(output_array).T)
# Restoring the jagged structure
filter_ = ak.fill_none((output_array != None), False)
output_array = output_array[filter_]
print(output_array) (I wrote the loop over axis 1 because axis 0 has large number of elements.) I have a more general concern: in complicated cases I have the impression that |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
On the larger point of what's easier, for-loops or array-oriented expressions, that's something I've been thinking about since the beginning (my original plans, before 2018, presumed that users would only want explicit loops, albeit in functional map/reduce form). Generally speaking, some things are easier to read (and construct) in an array-oriented way, others are easier with explicit loops. None of this is about what's easier for the computer, it's entirely about what fits with the human mind, and everybody has different subjective ideas about "easy" and "hard." Sometimes it's due to background—what you're more familiar with is going to look easier—but that's not 100% of it: there are some programming paradigms that I can't get used to, no matter how long I work with them. So here are two solutions to your problem, and I'll let you decide what's best for you. The array-oriented wayThe thing you probably needed to know is that ak.cartesian has a First, the difference between default >>> ak.cartesian([ak_array_2, ak_array_1]).tolist()
[[(0, 1), (0, 6), (4, 1), (4, 6), (7, 1), (7, 6)], [(2, 2)]] and >>> ak.cartesian([ak_array_2, ak_array_1], nested=True).tolist()
[[[(0, 1), (0, 6)], [(4, 1), (4, 6)], [(7, 1), (7, 6)]], [[(2, 2)]]] With Now you want to find differences between the lefts and rights of those 2-tuples. That would be a "flat" operation, one that doesn't change any nesting structure, just does elementwise math. To do that, we pull the lefts and rights of the tuples into two arrays, >>> a2, a1 = ak.unzip(ak.cartesian([ak_array_2, ak_array_1], nested=True))
>>> a2
<Array [[[0, 0], [4, 4], [7, 7]], [[2]]] type='2 * var * var * int64'>
>>> a1
<Array [[[1, 6], [1, 6], [1, 6]], [[2]]] type='2 * var * var * int64'> because that's what makes it possible to put them both in a formula. >>> abs(a2 - a1)
<Array [[[1, 6], [3, 2], [6, 1]], [[0]]] type='2 * var * var * int64'> Now you want a "reduce" (or "implode") operation: one that finds the minimum in each list. ak.min will do that if you give it the right >>> ak.min(abs(a2 - a1), axis=-1)
<Array [[1, 2, 1], [0]] type='2 * var * ?int64'> And there you go—that's the answer. The integer type is "optional" (question mark), meaning there could be RecapThe above came with a lot of explanation, but the full thing, from start to finish, is just a2, a1 = ak.unzip(ak.cartesian([ak_array_2, ak_array_1], nested=True))
result = ak.min(abs(a2 - a1), axis=-1) two lines. (It can't be comfortably done in one line—anything I can think of to shorten it further would only obfuscate the meaning.) The imperative wayAs you pointed out, you can do it with for loops. To do it at scale (i.e. fast enough for large datasets), you'll want JIT-compilation: Numba. You probably also needed to know about ak.ArrayBuilder, an imperative way to make Awkward Arrays, or at least ak.unflatten, a way to add nested list structure to to an originally flat array. First, let's do it outside of Numba to get the logic right. >>> builder = ak.ArrayBuilder()
>>> for list1, list2 in zip(ak_array_1, ak_array_2):
... builder.begin_list()
... for y in list2:
... best = None
... for x in list1:
... if best is None or abs(x - y) < best:
... best = abs(x - y)
... builder.append(best)
... builder.end_list()
...
>>> result = builder.snapshot()
>>> result
<Array [[1, 2, 1], [0]] type='2 * var * int64'> This code is also following some common (imperative) patterns: you compute a running extremum (min or max) by setting a "best so far" variable to some neutral value, like Unlike the array-oriented solution, which constructed a Cartesian product, this solution iterates over a Cartesian product. Like the array-oriented solution, there's an asymmetry between The above works, but if you have any large datasets, you'll find that iteration over Awkward Arrays is not fast. Awkward Arrays have to do a lot of indirection to produce a given element as a Python object—it would be faster to just turn them into Python built-in objects with ak.to_list and iterate over those (which Python has optimized as much as it can) than to iterate over them as Awkward Arrays. But it would be some 100's of times faster to loop over compiled, low-level values than even Python built-ins, so that's what we'll do. First, get Numba. That's the JIT-compiler. Awkward Array has Numba extensions so that we can iterate over our arrays in a Numba JIT-compiled function, but only iterate: the array-oriented functions like We can put the entirety of the for loop in a JIT-compiled function, since it's just iterating over the array structure, and ArrayBuilder operations have also been extended in Numba (less efficiently: building NumPy outputs would be faster, if your output were rectilinear). Constructing the ArrayBuilder and taking its >>> @nb.jit
... def compute(builder, array1, array2):
... for list1, list2 in zip(array1, array2):
... builder.begin_list()
... for y in list2:
... best = None
... for x in list1:
... if best is None or abs(x - y) < best:
... best = abs(x - y)
... builder.append(best)
... builder.end_list()
... return builder
...
>>> builder = compute(ak.ArrayBuilder(), ak_array_1, ak_array_2)
>>> result = builder.snapshot()
>>> result
<Array [[1, 2, 1], [0]] type='2 * var * int64'> All the caveats I've been describing should be taken as a warning: arbitrary Python code is not possible in Numba, just a numeric subset. They've documented the supported features here and here. Python, as a language, wasn't designed to be statically compiled (that would be Julia), but if you're using Python and you need a way "out" to do some imperative number-crunching at scale, Numba's a great way to do it. (I once gave a tutorial on good ways of using Numba. Generally, JIT-compile small functions and only gradually add bells and whistles, so that you know which ones aren't allowed when you hit them. For instance, in the above, we'd be tempted to use ArrayBuilder's ConclusionFor this particular case, I happen to think that the array-oriented solution is easier. However, it relies on "just knowing some stuff," like the existence of the Besides, there's no reason why a workflow couldn't include an array-oriented piece that feeds into an imperative piece that feeds into another array-oriented piece, etc. It's a big deal that the same data structure can be used for both, which wouldn't be the case with non-columnar data. |
Beta Was this translation helpful? Give feedback.
On the larger point of what's easier, for-loops or array-oriented expressions, that's something I've been thinking about since the beginning (my original plans, before 2018, presumed that users would only want explicit loops, albeit in functional map/reduce form). Generally speaking, some things are easier to read (and construct) in an array-oriented way, others are easier with explicit loops. None of this is about what's easier for the computer, it's entirely about what fits with the human mind, and everybody has different subjective ideas about "easy" and "hard." Sometimes it's due to background—what you're more familiar with is going to look easier—but that's not 100% of it: there are …