Calculating min distance between two jagged arrays having different jagged structures #1190

fleble · 2021-12-18T11:15:39Z

fleble
Dec 18, 2021

Dear experts,

I cannot find a straightforward way to make a calculation using ak array operations for a particular task. I was wondering whether I am missing something or you could point me to the relevant functions to do such a task. You can find below the description of the problem.

I have the following two ak arrays, for instance:

ak_array_1 = ak.Array([[1, 6], [2]])
ak_array_2 = ak.Array([[0, 4, 7], [2]])

And for each element of ak_array_2, I would like to compute the minimum distance to the corresponding nested list in array ak_array_1.
For instance here, the output would be:

[[1, 2, 1], [0]]

0 is closest to 1 and the difference is 1, 4 is closest to 6 and the difference is 2 etc...
In my use case, the number of elements is very large in axis 0, but not in axis 1 (length < 10).

I found a solution for it (see below), but it is quite convoluted, and I was wondering if there exists a simpler solution for it

for idx in range(ak.max(ak.num(ak_array_2, axis=1), axis=0)):
    ak_array_2_masked = ak.mask(ak_array_2, ak.count(ak_array_2, axis=1)>idx)
    ak_array_2_one_index = ak_array_2_masked[:, idx][:, np.newaxis]
    ak_array_2_broadcasted = ak.broadcast_arrays(ak_array_2_one_index, ak_array_1)[0]
    distances = abs(ak_array_2_broadcasted - ak_array_1)

    output = ak.Array([ak.min(distances, axis=1)])
    if idx == 0:
        output_array = output
    else:
        output_array = ak.concatenate((output_array, output), axis=0)

# Inverting axis 0 and 1
output_array = ak.from_regular(ak.to_numpy(output_array).T)

# Restoring the jagged structure
filter_ = ak.fill_none((output_array != None), False)
output_array = output_array[filter_]
print(output_array)

(I wrote the loop over axis 1 because axis 0 has large number of elements.)

I have a more general concern: in complicated cases I have the impression that for loops are easy to write while it becomes quite challenging to write correct array operations. I was wondering what the strategy should be. Is there any way to fall back on for loops over axis 0, still in a quite efficient way?

Answered by jpivarski

Dec 19, 2021

On the larger point of what's easier, for-loops or array-oriented expressions, that's something I've been thinking about since the beginning (my original plans, before 2018, presumed that users would only want explicit loops, albeit in functional map/reduce form). Generally speaking, some things are easier to read (and construct) in an array-oriented way, others are easier with explicit loops. None of this is about what's easier for the computer, it's entirely about what fits with the human mind, and everybody has different subjective ideas about "easy" and "hard." Sometimes it's due to background—what you're more familiar with is going to look easier—but that's not 100% of it: there are …

View full answer

jpivarski · 2021-12-19T19:38:31Z

jpivarski
Dec 19, 2021
Maintainer

On the larger point of what's easier, for-loops or array-oriented expressions, that's something I've been thinking about since the beginning (my original plans, before 2018, presumed that users would only want explicit loops, albeit in functional map/reduce form). Generally speaking, some things are easier to read (and construct) in an array-oriented way, others are easier with explicit loops. None of this is about what's easier for the computer, it's entirely about what fits with the human mind, and everybody has different subjective ideas about "easy" and "hard." Sometimes it's due to background—what you're more familiar with is going to look easier—but that's not 100% of it: there are some programming paradigms that I can't get used to, no matter how long I work with them.

So here are two solutions to your problem, and I'll let you decide what's best for you.

The array-oriented way

The thing you probably needed to know is that ak.cartesian has a nested parameter, which preserves the original list structure when computing a Cartesian product. Your problem then fits an "explode-flat-reduce" pattern, which is common enough that I had been talking about it in ancient times (2016‒2017). The Cartesian product is the "explode" operation.

First, the difference between default nested=False:

>>> ak.cartesian([ak_array_2, ak_array_1]).tolist()
[[(0, 1), (0, 6), (4, 1), (4, 6), (7, 1), (7, 6)], [(2, 2)]]

and nested=True:

>>> ak.cartesian([ak_array_2, ak_array_1], nested=True).tolist()
[[[(0, 1), (0, 6)], [(4, 1), (4, 6)], [(7, 1), (7, 6)]], [[(2, 2)]]]

With nested=True, there's an extra level of list depth, and the nested lists follow the structure of the first array argument, ak_array_2. (The nested parameter can take more specific arguments than False and True, which could let you write the two arrays in their original order, ak_array_1 before ak_array_2, but I think that would only complicated it.)

Now you want to find differences between the lefts and rights of those 2-tuples. That would be a "flat" operation, one that doesn't change any nesting structure, just does elementwise math. To do that, we pull the lefts and rights of the tuples into two arrays,

>>> a2, a1 = ak.unzip(ak.cartesian([ak_array_2, ak_array_1], nested=True))
>>> a2
<Array [[[0, 0], [4, 4], [7, 7]], [[2]]] type='2 * var * var * int64'>
>>> a1
<Array [[[1, 6], [1, 6], [1, 6]], [[2]]] type='2 * var * var * int64'>

because that's what makes it possible to put them both in a formula.

>>> abs(a2 - a1)
<Array [[[1, 6], [3, 2], [6, 1]], [[0]]] type='2 * var * var * int64'>

Now you want a "reduce" (or "implode") operation: one that finds the minimum in each list. ak.min will do that if you give it the right axis parameter. (For the deepest lists, the most common case, that's axis=-1.)

>>> ak.min(abs(a2 - a1), axis=-1)
<Array [[1, 2, 1], [0]] type='2 * var * ?int64'>

And there you go—that's the answer. The integer type is "optional" (question mark), meaning there could be None values in the result if any lists were empty. If you don't want that, use mask_identity=False (in which case, the minimum of an empty list would be infinity for floating point types and MAXINT for integer types, rather than None).

Recap

The above came with a lot of explanation, but the full thing, from start to finish, is just

a2, a1 = ak.unzip(ak.cartesian([ak_array_2, ak_array_1], nested=True))
result = ak.min(abs(a2 - a1), axis=-1)

two lines. (It can't be comfortably done in one line—anything I can think of to shorten it further would only obfuscate the meaning.)

The imperative way

As you pointed out, you can do it with for loops. To do it at scale (i.e. fast enough for large datasets), you'll want JIT-compilation: Numba. You probably also needed to know about ak.ArrayBuilder, an imperative way to make Awkward Arrays, or at least ak.unflatten, a way to add nested list structure to to an originally flat array.

First, let's do it outside of Numba to get the logic right.

>>> builder = ak.ArrayBuilder()
>>> for list1, list2 in zip(ak_array_1, ak_array_2):
...     builder.begin_list()
...     for y in list2:
...         best = None
...         for x in list1:
...             if best is None or abs(x - y) < best:
...                 best = abs(x - y)
...         builder.append(best)
...     builder.end_list()
... 
>>> result = builder.snapshot()
>>> result
<Array [[1, 2, 1], [0]] type='2 * var * int64'>

This code is also following some common (imperative) patterns: you compute a running extremum (min or max) by setting a "best so far" variable to some neutral value, like None, and update it whenever a more extreme value appears. Since we're minimizing, we could have set the initial best to infinity for floating point or MAXINT for integer types, and the logic would be the same without the if best is None guard. This is the same choice we had in ak.min, about whether to set mask_identity.

Unlike the array-oriented solution, which constructed a Cartesian product, this solution iterates over a Cartesian product. Like the array-oriented solution, there's an asymmetry between ak_array_1 and ak_array_2: you want the output to have the structure of ak_array_2, so that has to be the outer loop (just as it had to be the first in ak.cartesian's argument list).

The above works, but if you have any large datasets, you'll find that iteration over Awkward Arrays is not fast. Awkward Arrays have to do a lot of indirection to produce a given element as a Python object—it would be faster to just turn them into Python built-in objects with ak.to_list and iterate over those (which Python has optimized as much as it can) than to iterate over them as Awkward Arrays. But it would be some 100's of times faster to loop over compiled, low-level values than even Python built-ins, so that's what we'll do.

First, get Numba. That's the JIT-compiler. Awkward Array has Numba extensions so that we can iterate over our arrays in a Numba JIT-compiled function, but only iterate: the array-oriented functions like ak.concatenate, advanced slicing, etc. are not available in the JIT-compiled code. (It would be an enormous amount of work: extending Numba is like porting to another language.) So the choice is either/or: array-oriented code outside of Numba, imperative code inside of Numba.

We can put the entirety of the for loop in a JIT-compiled function, since it's just iterating over the array structure, and ArrayBuilder operations have also been extended in Numba (less efficiently: building NumPy outputs would be faster, if your output were rectilinear). Constructing the ArrayBuilder and taking its snapshot are not possible in the JIT-compiled function because they involve runtime types. With these constraints, the solution becomes:

>>> @nb.jit
... def compute(builder, array1, array2):
...     for list1, list2 in zip(array1, array2):
...         builder.begin_list()
...         for y in list2:
...             best = None
...             for x in list1:
...                 if best is None or abs(x - y) < best:
...                     best = abs(x - y)
...             builder.append(best)
...         builder.end_list()
...     return builder
... 
>>> builder = compute(ak.ArrayBuilder(), ak_array_1, ak_array_2)
>>> result = builder.snapshot()
>>> result
<Array [[1, 2, 1], [0]] type='2 * var * int64'>

All the caveats I've been describing should be taken as a warning: arbitrary Python code is not possible in Numba, just a numeric subset. They've documented the supported features here and here. Python, as a language, wasn't designed to be statically compiled (that would be Julia), but if you're using Python and you need a way "out" to do some imperative number-crunching at scale, Numba's a great way to do it.

(I once gave a tutorial on good ways of using Numba. Generally, JIT-compile small functions and only gradually add bells and whistles, so that you know which ones aren't allowed when you hit them. For instance, in the above, we'd be tempted to use ArrayBuilder's list() context manager to simplify the begin_list()/end_list() construction, but context managers aren't in Numba's list of supported Python features. If that wasn't added in a small step, it would be hard to know from the error message that it's the context manager, not something else, that was wrong.)

Conclusion

For this particular case, I happen to think that the array-oriented solution is easier. However, it relies on "just knowing some stuff," like the existence of the nested parameter in ak.cartesian, and maybe also familiarity with the explode-flat-reduce pattern, to notice that this is an example of it. There are more complex problems where I'd favor Numba (or Julia), but since it can go either way, I think it's a good idea to keep both doors open.

Besides, there's no reason why a workflow couldn't include an array-oriented piece that feeds into an imperative piece that feeds into another array-oriented piece, etc. It's a big deal that the same data structure can be used for both, which wouldn't be the case with non-columnar data.

1 reply

fleble Dec 19, 2021
Author

@jpivarski A huge thank you for this very detailed and clear explanation! This is very helpful.

Generally speaking, some things are easier to read (and construct) in an array-oriented way, others are easier with explicit loops. None of this is about what's easier for the computer, it's entirely about what fits with the human mind, and everybody has different subjective ideas about "easy" and "hard."

I agree with that statement! I need to get more acquainted with the array-oriented approach and all its potential.

Thank you very much for providing the example with Numba and ak.ArrayBuilder and pointing to the supported features with Numba.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating min distance between two jagged arrays having different jagged structures #1190

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Calculating min distance between two jagged arrays having different jagged structures #1190

fleble Dec 18, 2021

Replies: 1 comment · 1 reply

jpivarski Dec 19, 2021 Maintainer

The array-oriented way

Recap

The imperative way

Conclusion

fleble Dec 19, 2021 Author

fleble
Dec 18, 2021

Replies: 1 comment 1 reply

jpivarski
Dec 19, 2021
Maintainer

fleble Dec 19, 2021
Author