Prevent reducers like ak.sum on records (v1 and v2) #1375

jpivarski · 2022-03-17T18:13:07Z

Version of Awkward Array

HEAD

Description and code to reproduce

@chadfreer just brought this up, but I seem to remember @lgray and/or @nsmith- asking about it a long time ago.

The issue is this: if we have records whose fields represent non-Cartesian data, they should not be naively added (or otherwise "reduced").

>>> ak.sum(ak.Array([[{"rho": 1.1, "phi": -0.1}, {"rho": 2.2, "phi": 0.1}]]), axis=1)
<Array [{rho: 3.3, phi: 0}] type='1 * {"rho": float64, "phi": float64}'>

Knowing that this is a generic array, we can shrug and say, "I guess it would do that. It has no way of knowing what rho and phi mean." But if it were a VectorAwkward2D, we'd really expect it to take the meaning of rho and phi into consideration when it adds them, because that's what + does:

>>> import vector
>>> vector.register_awkward()
>>> (ak.Array([[{"rho": 1.1, "phi": -0.1}]], with_name="Vector2D") +
...  ak.Array([[{"rho": 2.2, "phi": 0.1}]], with_name="Vector2D")
<VectorArray2D [[{rho: 3.29, phi: 0.0334}]] type='1 * var * Vector2D["rho": floa...'>

When either @lgray or @nsmith- brought this up, I said that we couldn't let third-party libraries override reducers like ak.sum because they are implemented (in Awkward 1.x) in C++, where it would be very difficult to apply overrides implemented in Python. But in Awkward v2, the relevant code is Python and it would be possible.

But there are two issues here:

The fact that Coffea and Vector override + and not ak.sum is dangerously misleading: attempts to ak.sum records should be prevented for exactly the same reason as + is prevented between non-overridden records. That is, apply the rationale behind Removing ufunc-broadcasting across record fields #457 to reducers, not just ufuncs.
Awkward v2 should provide hooks to override chosen reducers. The syntax could be:

ak._v2.behavior[ak._v2.sum, "VectorArray2D"] = special_function

and the special_function would be invoked somewhere in here, for ListOffsetArray for example. (Although I thought so while talking with @chadfreer, this can't be implemented through #1126, since this is a behavior that might be discovered deep inside an array, not at the level where a Python object exists.)

There would have to be some rules on when special_function is applied. Overrides for axis < the depth of the matched name (e.g. "VectorArray2D") wouldn't make a lot of sense, since they cross list/record object boundaries:

>>> ak.sum(ak.Array([[1, 2, 3], [100, 200]]), axis=0)
<Array [101, 202, 3] type='3 * int64'>

While thinking about expanding capabilities in v2 is fun, item (1) in the above list is the higher-priority item: we need to avoid misleading users into thinking that ak.sum(geometric_vectors) does what they might think it does. An error message is better than a wrong answer.

(This issue is a rare bug-feature-policy!)

The text was updated successfully, but these errors were encountered:

jpivarski · 2022-04-05T16:58:14Z

Note that until this gets fixed, users will need some way to add up arbitrary numbers of vectors per list in an Awkward Array. This example code does that with Numba:

import time

import numpy as np
import awkward as ak
import numba as nb
import vector

vector.register_awkward()

fake_o = {"pt": 1.1, "phi": 2.2, "eta": 3.3, "mass": 10}

array = ak.Array(
    [
        [fake_o, fake_o, fake_o],
        [],
        [fake_o, fake_o],
        [fake_o, fake_o, fake_o, fake_o],
    ],
    with_name="Momentum4D",
)

@nb.jit
def add_lists_of_vectors(array):
    output_px = np.empty(len(array))
    output_py = np.empty(len(array))
    output_pz = np.empty(len(array))
    output_E = np.empty(len(array))

    for index, single_list in enumerate(array):
        sum_of_vectors = vector.obj(px=0.0, py=0.0, pz=0.0, E=0.0)

        for input_vector in single_list:
            sum_of_vectors = sum_of_vectors + input_vector
            pass

        output_px[index] = sum_of_vectors.px
        output_py[index] = sum_of_vectors.py
        output_pz[index] = sum_of_vectors.pz
        output_E[index] = sum_of_vectors.E

    return output_px, output_py, output_pz, output_E

starttime = time.time()
output_px, output_py, output_pz, output_E = add_lists_of_vectors(array)
print("compilation + runtime:", time.time() - starttime)

output = ak.zip(
    {"px": output_px, "py": output_py, "pz": output_pz, "E": output_E},
    with_name="Momentum4D",
)

print(output)

The use of Numba (or some compilation somewhere) is necessary. In some tests, I found that Numba compilation time is 1.5 seconds and runtime begins to dominate when you multiply the length of the fake_o example by ten million. Then the runtime is 0.6 microseconds per fake-o group.

But the pure Python runtime per fake-o group (i.e. drop the @nb.jit) is 0.1 seconds, which is 187000× slower. That's staggering, but there are a lot of culprits for the difference: indirection in Awkward Array __getitem__, changes of coordinates in Vector, ... Python itself (if you were to streamline it by replacing the Awkward Arrays and Vectors with builtin lists and class instances with __slots__) shouldn't be much more than 100× or 1000× slower.

The reason the output is given as NumPy arrays for each component (output_px, etc.) is because of scikit-hep/vector#43. If you want the output as an array of non-Cartesian vectors, substitute sum_of_vectors.pt, etc. for sum_of_vectors.px, etc.

nsmith- · 2022-04-06T13:41:38Z

We implement this vector sum as a mixin method in coffea vector classes: array.sum(axis=1) simply sums each cartesian component: e.g. this 2-vector implementation copied from here

    def sum(self, axis=-1):
        """Sum an array of vectors elementwise using `x` and `y` components"""
        out = awkward.zip(
            {
                "x": awkward.sum(self.x, axis=axis),
                "y": awkward.sum(self.y, axis=axis),
            },
            with_name="TwoVector",
            highlevel=False,
        )
        return awkward.Array(out, behavior=self.behavior)

Unless I am missing something, I don't see why it would need to be done in numba.

jpivarski · 2022-04-06T16:12:35Z

You're not missing anything; I just didn't think of that yesterday when I was talking with @kpedro88. What you have here works just as well.

This issue is to prevent ak.sum from doing misleading computations with non-Cartesian vectors, and v2 would eventually make it possible for Vector to override ak._v2.sum to do the right thing. The only difference between that and your solution is that it would be in ak._v2.sum, rather than a .sum method. (That distinction could be called "cosmetic" if it were not for the fact that the ak._v2.sum function still exists and we don't want them to do different things, one correct and the other incorrect.)

In your uses, is axis != -1 ever meaningful? Actually, I guess it may be:

array = [
    [vector_A, vector_B, vector_C],
    [vector_D,     None, vector_E],
    [vector_F],
]

I guess there are cases when you'd want to get

[vector_A + vector_D + vector_F, vector_B, vectorC + vector_E]

I had been thinking of having the reducer-overload exclude axis != -1, but I guess we shouldn't, just as you don't.

kpedro88 · 2022-04-06T16:32:27Z

The other difference is that the Numba solution adds the vectors in their native representation, while the Coffea solution requires converting to Cartesian, correct?

jpivarski · 2022-04-06T17:52:26Z

That's true, though I'm pretty sure that all of Vector's + operations convert to Cartesian. The Numba solution leaves open the possibility of Vector adding more optimized + in the future.

Nope! I'm wrong! If two vectors both have rho-phi azimuthal coordinates, they will be added in rho-phi that system and the result will be rho-phi.

https://github.com/scikit-hep/vector/blob/6553752691fd41120ab095e2a5c875e894e475cf/src/vector/_compute/planar/add.py#L45-L53

So, for instance, these vectors stay in rho-phi:

>>> vector.obj(rho=1.1, phi=0.3) + vector.obj(rho=2.2, phi=-0.4)
vector.obj(rho=3.122793010504687, phi=-0.17108096774473536)

But anything else will go to x-y:

>>> vector.obj(rho=1.1, phi=0.3) + vector.obj(x=2.2, y=4.4)
vector.obj(x=3.2508701380381666, y=4.725072227327474)

Longitudinal and temporal coordinates always convert to Cartesian before adding. (Currently. Changing that shouldn't break anything downstream if you always get coordinates from properties, rather than field values.)

nsmith- · 2022-04-06T19:37:26Z

If you're summing a bunch then doing the cartesian conversion once is more efficient anyway

jpivarski · 2022-04-15T18:44:05Z

The two issues described here have been split: the part about adding the ability to override reducers in v2 is now issue #1423.

jpivarski added bug The problem described is something that must be fixed feature New feature or request policy Choice of behavior labels Mar 17, 2022

jpivarski removed the feature New feature or request label Apr 15, 2022

jpivarski mentioned this issue Apr 15, 2022

Allow reducers to be overridden in v2 #1423

Closed

jpivarski changed the title ~~Prevent reducers like ak.sum on records (v1 and v2) and allow reducers to be overridden in v2~~ Prevent reducers like ak.sum on records (v1 and v2) Apr 15, 2022

ioanaif self-assigned this Jul 27, 2022

ioanaif mentioned this issue Aug 19, 2022

feat: prevent reducers like ak.sum on records (v2) #1607

Merged

ioanaif linked a pull request Aug 19, 2022 that will close this issue

feat: prevent reducers like ak.sum on records (v2) #1607

Merged

jpivarski closed this as completed in #1607 Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent reducers like ak.sum on records (v1 and v2) #1375

Prevent reducers like ak.sum on records (v1 and v2) #1375

jpivarski commented Mar 17, 2022

jpivarski commented Apr 5, 2022

nsmith- commented Apr 6, 2022

jpivarski commented Apr 6, 2022

kpedro88 commented Apr 6, 2022

jpivarski commented Apr 6, 2022

nsmith- commented Apr 6, 2022

jpivarski commented Apr 15, 2022

Prevent reducers like ak.sum on records (v1 and v2) #1375

Prevent reducers like ak.sum on records (v1 and v2) #1375

Comments

jpivarski commented Mar 17, 2022

Version of Awkward Array

Description and code to reproduce

jpivarski commented Apr 5, 2022

nsmith- commented Apr 6, 2022

jpivarski commented Apr 6, 2022

kpedro88 commented Apr 6, 2022

jpivarski commented Apr 6, 2022

nsmith- commented Apr 6, 2022

jpivarski commented Apr 15, 2022