Any equivalent operation like np.argwhere? #3015

Star9daisy · 2024-02-09T04:11:55Z

Star9daisy
Feb 9, 2024

Hi, developers of awkward,

I'm wondering if there is some function like np.argwhere to find the element index?

array = ak.from_iter([[1, 2, 3], [], [4, 5], [6, 7, 8, 9]])

# If there is that function
# ak.argwhere(array == 1)
# Expected returned value:
# [(0, 0)]

agoose77 · 2024-02-09T10:52:41Z

agoose77
Feb 9, 2024
Maintainer

The best answer to this is probably "what will you do with that index?".

Awkward Array has a powerful ragged indexing system that supports structure-preserving integer/boolean indices. We touch on using these indexing arrays in our user guide. The semantics of this "ragged indexing" are also mentioned in the API reference (for now).

For example, if you need to find the indices of even numbers:

import awkward as ak
array = ak.Array(
    [
        [1, 2, 3],
        [],
        [4, 5],
        [6, 7, 8, 9],
    ]
)
ix = ak.local_index(array)
ix_even = ix[array % 2 == 0]

This ix_even array can then be used to index into array (or any array with the same structure). Note that, if you were just interested in filtering array, you could skip the step that computes the index, i.e.

array_even = array[array % 2 == 0]

0 replies

jpivarski · 2024-02-09T14:42:59Z

jpivarski
Feb 9, 2024
Maintainer

@agoose77 is right; you might be trying to find a long way to do what could be done with a slice. But assuming that you need positions of matching indexes as tuples, it can be done in your two-dimensional example like this:

import numpy as np
import awkward as ak
array = ak.Array(
    [
        [1, 2, 3],
        [],
        [4, 5],
        [6, 7, 8, 9],
    ]
)

second = ak.local_index(array)[array % 2 == 0]    # [[1], [], [0], [0, 2]], what @agoose77 suggested
first = np.arange(len(second))                    # [ 0 ,  1,  2 ,     3 ]
first, _ = ak.broadcast_arrays(first, second)     # [[0], [], [2], [3, 3]]
result = ak.flatten(ak.zip((first, second)))      # [(0, 1), (2, 0), (3, 0), (3, 2)]

This technique would require a different number of steps for each number of dimensions. Also, if you actually wanted lists (contiguous data) instead of tuples (not contiguous), you can do it by concatenating instead of zipping:

ak.flatten(ak.concatenate((first[..., np.newaxis], second[..., np.newaxis]), axis=-1))
                                                  # [[0, 1], [2, 0], [3, 0], [3, 2]]

(Which one you want depends on how you're going to use it...)

0 replies

Star9daisy · 2024-02-10T10:27:21Z

Star9daisy
Feb 10, 2024
Author

Thank you so much for replying so quickly! Yeah, it's all about what I'm gonna do with these indexes. Let me explain more:

Given the output root file from Delphes in Madgraph5, I want to rebuild fatjets according to constituents, which is basically the same thing asked in stack overflow in 2019.

@jpivarski already answered that time. But it's still kind of hard to do it step by step:

awkward is so different from that time;
The "Tower.Particles", a TRefArray, mentioned in the answer refers to the unique indexes of the branch "Particles"; However, the "Constituents" refers to "Tower" and "Track" these two different branch. So it's not that straightforward to use the TRefArray directly.

The questioner John Karkas gave some hints in the reply:

Delphes stores the IDs per event for objects...

It hits me that there does exist a subbranch named "fUniqueID". After a few try, I find that the "refs" of "FatJet.Constituents" stores the unique id that corresponds to a EFlowTrack or a EFlowPhoton or a EFlowNeutralHadron. So here's the reason behind my question: I need to find the corresponding indexes in one of three according to the ids so that I can build "read" constituents rather than "refs".

I need to search EFlowTrack EFlowPhoton EFlowNeutralHadron to find the location of constituents' id.

If it's a "one element's location in one array", I would use np.argwhere to retrieve the indexes; But it's now more like a "one array's location in another three array". To avoid loop as much as possible, I do some search and find np.isin function. It is exactly what I want. Here's my final solution:

import numpy as np
import uproot
import awkward as ak
import vector

vector.register_awkward()

filepath = "../data/pp2wz/Events/run_01_decayed_1/tag_1_delphes_events.root"
events = uproot.open(f"{filepath}:Delphes")

all_ref_ids = events["FatJet.Constituents"].array()["refs"]
all_tracks = events["EFlowTrack.fUniqueID"].array()
all_photons = events["EFlowPhoton.fUniqueID"].array()
all_neutral_hadrons = events["EFlowNeutralHadron.fUniqueID"].array()

all_tracks = ak.zip(
    {
        "pt": events["EFlowTrack.PT"].array(),
        "eta": events["EFlowTrack.Eta"].array(),
        "phi": events["EFlowTrack.Phi"].array(),
        "mass": events["EFlowTrack.Mass"].array(),
        "id": all_tracks,
    },
    with_name="Momentum4D",
)
all_photons = ak.zip(
    {
        "pt": events["EFlowPhoton.ET"].array(),
        "eta": events["EFlowPhoton.Eta"].array(),
        "phi": events["EFlowPhoton.Phi"].array(),
        "mass": ak.zeros_like(events["EFlowPhoton.ET"].array()),
        "id": all_photons,
    },
    with_name="Momentum4D",
)
all_neutral_hadrons = ak.zip(
    {
        "pt": events["EFlowNeutralHadron.ET"].array(),
        "eta": events["EFlowNeutralHadron.Eta"].array(),
        "phi": events["EFlowNeutralHadron.Phi"].array(),
        "mass": ak.zeros_like(events["EFlowNeutralHadron.ET"].array()),
        "id": all_neutral_hadrons,
    },
    with_name="Momentum4D",
)

all_constituents = []

for tracks, photons, neutral_hadrons, ref_ids in zip(
    all_tracks, all_photons, all_neutral_hadrons, all_ref_ids
):
    constituents = []
    for ref_id in ref_ids:
        matched_tracks = tracks[np.isin(tracks.id, ref_id)]
        matched_photons = photons[np.isin(photons.id, ref_id)]
        matched_neutral_hadrons = neutral_hadrons[np.isin(neutral_hadrons.id, ref_id)]

        assert len(ref_id) == (
            len(matched_tracks) + len(matched_photons) + len(matched_neutral_hadrons)
        )

        constituents.append(
            ak.concatenate([matched_tracks, matched_photons, matched_neutral_hadrons])
        )

    all_constituents.append(constituents)

all_constituents = ak.from_iter(all_constituents)

The all_constituents now is:

[[],
 [[{pt: 0.48, eta: -0.126, phi: 0.389, mass: 0.14, id: 2857}, ..., {...}]],
 [[{pt: 9.77, eta: 0.418, phi: -2.74, mass: 0.14, id: 1313}, ..., {...}]],
 [[{pt: 0.321, eta: -0.489, phi: -2.66, mass: 0.14, id: 2303}, ..., {...}]],
 [[{pt: 0.914, eta: 1.62, phi: 2.06, mass: 0.14, id: 1957}, {...}, ..., {...}]],
 [[{pt: 0.688, eta: 1.88, phi: 1.55, mass: 0.14, id: 2178}, {...}, ..., {...}]],
 [[{pt: 1.23, eta: -0.111, phi: -1.55, mass: 0.494, id: 3379}, ...], ...],
 [[{pt: 2.18, eta: -0.152, phi: 1.75, mass: 0.14, id: 621}, {...}, ..., {...}]],
 [[{pt: 0.365, eta: -0.572, phi: -2.08, mass: 0.14, id: 2336}, ..., {...}]],
 [[{pt: 4.69, eta: -0.356, phi: -2.85, mass: 0.14, id: 1147}, ..., {...}]],
 ...,
 [[{pt: 8.91, eta: -0.63, phi: -1.28, mass: 0.14, id: 2677}, ..., {...}], ...],
 [[{pt: 0.83, eta: -1.64, phi: 2.98, mass: 0.14, id: 2047}, {...}, ..., {...}]],
 [[{pt: 0.23, eta: 1.85, phi: 2.04, mass: 0.14, id: 1388}, {...}, ..., {...}]],
 [[{pt: 0.386, eta: -1.13, phi: 2.44, mass: 0.494, id: 2611}, ..., {...}]],
 [[{pt: 1.35, eta: -0.959, phi: 2.32, mass: 0.14, id: 3270}, ..., {...}]],
 [[{pt: 0.422, eta: -2.32, phi: 2.9, mass: 0.938, id: 3551}, ..., {...}]],
 [[{pt: 4.21, eta: 0.422, phi: -1.63, mass: 0.14, id: 1950}, ..., {...}]],
 [[{pt: 1.52, eta: -0.3, phi: -1.96, mass: 0.494, id: 1257}, ..., {...}]],
 [[{pt: 20, eta: 0.404, phi: 2.64, mass: 0.14, id: 1640}, ..., {...}], ...]]
--------------------------------------------------------------------------------
type: 100 * var * var * {
    pt: float64,
    eta: float64,
    phi: float64,
    mass: float64,
    id: int64
}

The type could be explained as: 100 events, var jets, var constituents. The last var indicates the constituents, since they're reclustered to build the fatjet, it's always 1:

import fastjet as fj

particles = ak.zip(
    {
        "pt": all_constituents.pt,
        "eta": all_constituents.eta,
        "phi": all_constituents.phi,
        "mass": all_constituents.mass,
    },
    with_name="Momentum4D",
)

jet_def = fj.JetDefinition(fj.antikt_algorithm, 100.0)
cluster = fj.ClusterSequence(particles, jet_def)
jets = cluster.inclusive_jets()

# To index, for example, the second fatjet in the seventh event
# jets[6, 1, 0]
# the last index is always 0

I know it's still not "perfect": I have to loop twice: one for event, and one for fatjets since it's a recluster not a cluster one (I'll post a cluster one to remove one loop). It's not a long time after I use awkward, so if you have any better idea, please let me know. Thank you!

0 replies

Star9daisy · 2024-02-10T10:36:05Z

Star9daisy
Feb 10, 2024
Author

This is another version of event loop that collects all constituents of one event and cluster jets:

all_constituents = []

for tracks, photons, neutral_hadrons, ref_ids in zip(
    all_tracks, all_photons, all_neutral_hadrons, all_ref_ids
):
    matched_tracks = tracks[np.isin(tracks.id, ak.flatten(ref_ids))]
    matched_photons = photons[np.isin(photons.id, ak.flatten(ref_ids))]
    matched_neutral_hadrons = neutral_hadrons[
        np.isin(neutral_hadrons.id, ak.flatten(ref_ids))
    ]

    constituents = ak.concatenate(
        [matched_tracks, matched_photons, matched_neutral_hadrons]
    )
    all_constituents.append(constituents)

This version don't differentiate constituents of different jets so the type is "100 * var ...". Cluster jets as the previous one, then index for example, the second fatjet in the seventh event:

jets[6, 1]

4 replies

agoose77 Feb 12, 2024
Maintainer

Right now we don't have an overload for np.isin. You could implement a 2D variant of the kernel in Numba. If you need to scale to larger dimensionality with axis=-1, then one can just use ak.transform for this.

Unfortunately it doesn't look like Numba implement an overload for isin, which is a shame because there are different performance characteristics according to the method used.

Here's a trivial kernel

import numba as nb


@nb.njit
def isin_kernel_2d(needle, haystack, result):
    for i in range(len(needle)):
        for y in haystack[i]:
            if needle[i] == y:
                result[i] = True
                break
        else:
            result[i] = False

all_track_is_in = isin_kernel_2d(all_tracks.id, ak.flatten(all_ref_ids, axis=-1))

but you'd really want to do something smarter e.g. at least binary search over a sorted haystack. NumPy implement a table method which can be used if you know the nature of the values in haystack (e.g. how dense they are, are they sorted, etc).

I looked into which functions pyarrow.compute provides, in case we can leverage an existing solution, but it appears that they only implement a 1D variant.

jpivarski Feb 12, 2024
Maintainer

If each haystack[i] is relatively small, the difference between your $\mathcal{O}(n)$ implementation¹ (in Numba, above) and a smarter $\mathcal{O}(\log n)$ implementation might not even matter. In fact, the $\mathcal{O}(\log n)$ implementation implementation might be worse than the $\mathcal{O}(n)$ implementation.

What's the average size of haystack[i]?

Bisection search is $\mathcal{O}(\log n)$, but it only works for sorted lists and it's $\mathcal{O}(n \log n)$ (i.e. worse) if you have to sort the list. Similarly if you want to do a binary tree search and have to build the binary trees first.

And even if each haystack[i] is already sorted and you can immediately do a bisection search over each one of them, the branching code to do that search might thwart compiler optimizations. There's a turn-over point: above some average haystack[i] size, better time complexity always wins, but I'd expect that to be at least 50 items or so...²

where $n$ is the length of each haystack[i]. ↩
just a guess. I'm tempted to do an experiment, but I have a lot of emails to get through still... ↩

jpivarski Feb 12, 2024
Maintainer

So, the nerd-sniping thing again. I wanted to know how much it would matter because we tend to have a lot of problems in which the classical solutions optimize for one big X (one big matrix multiplication or inversion, one big minimization, etc.) and we have many small Xes. This is another case like that.

Here is code to search for the point at which bisection searches (the classic solution for a big X = needle in a haystack search) wins over the linear search, which has the advantage of compiling down to simpler instructions and memory strides.

import time

import awkward as ak
import numpy as np
import numba as nb


def testy(haystack_size, inv_density=1.5, sample_size=10000000):
    sorted_numbers = np.random.poisson(inv_density, sample_size)
    np.cumsum(sorted_numbers, out=sorted_numbers)

    offsets = np.arange(0, len(sorted_numbers), haystack_size)
    haystacks = ak.Array(
        ak.contents.ListOffsetArray(
            ak.index.Index64(offsets),
            ak.contents.NumpyArray(sorted_numbers),
        )
    )

    needles = np.random.randint(0, haystack_size, len(offsets) - 1)
    needles = sorted_numbers[offsets[:-1] + needles]
    needles += np.random.randint(-1, 2, len(needles))  # don't make them all return True

    @nb.njit
    def ragged_isin(needles, haystacks):
        out = np.zeros(len(needles), dtype=np.bool_)
        for i in range(len(needles)):
            needle = needles[i]
            haystack = haystacks[i]
            for x in haystack:
                if needle == x:
                    out[i] = True
                    break
        return out

    @nb.njit
    def ragged_bisection_isin(needles, haystacks):
        out = np.zeros(len(needles), dtype=np.bool_)
        for i in range(len(needles)):
            needle = needles[i]
            haystack = haystacks[i]
            left = 0
            right = len(haystack) - 1
            while left <= right:
                mid = (left + right) >> 1
                if haystack[mid] == needle:
                    out[i] = True
                    break
                elif haystack[mid] < needle:
                    left = mid + 1
                else:
                    right = mid - 1
        return out

    a = ragged_isin(needles, haystacks)
    b = ragged_bisection_isin(needles, haystacks)

    assert (a == b).all()

    measurements = []
    for _ in range(10):
        before = time.perf_counter_ns()
        tmp = ragged_isin(needles, haystacks)
        after = time.perf_counter_ns()
        measurements.append(after - before)

    first_mean = np.mean(measurements) / len(needles)
    first_std = np.std(measurements) / len(needles)

    measurements = []
    for _ in range(10):
        before = time.perf_counter_ns()
        tmp = ragged_bisection_isin(needles, haystacks)
        after = time.perf_counter_ns()
        measurements.append(after - before)

    second_mean = np.mean(measurements) / len(needles)
    second_std = np.std(measurements) / len(needles)

    print(
        f"haystack_size: {haystack_size} linear: {first_mean:.3f} +- {first_std:.3f} ns, bisection: {second_mean:.3f} +- {second_std:.3f} ns"
    )


testy(1)
testy(2)
testy(5)
testy(10)
testy(20)
testy(50)
testy(100)
testy(200)
testy(500)
testy(1000)
testy(2000)
testy(5000)
testy(10000)
testy(20000)
testy(50000)
testy(100000)

And here are the results:

haystack_size: 1 linear: 3.582 +- 0.177 ns, bisection: 6.964 +- 0.024 ns
haystack_size: 2 linear: 4.399 +- 0.143 ns, bisection: 8.410 +- 0.032 ns
haystack_size: 5 linear: 5.615 +- 0.019 ns, bisection: 11.821 +- 0.064 ns
haystack_size: 10 linear: 6.889 +- 0.136 ns, bisection: 14.524 +- 0.030 ns
haystack_size: 20 linear: 9.450 +- 0.093 ns, bisection: 18.199 +- 0.072 ns
haystack_size: 50 linear: 18.108 +- 0.567 ns, bisection: 23.995 +- 0.394 ns
haystack_size: 100 linear: 37.471 +- 1.062 ns, bisection: 52.758 +- 0.983 ns
haystack_size: 200 linear: 93.914 +- 1.529 ns, bisection: 106.149 +- 2.265 ns
haystack_size: 500 linear: 215.370 +- 11.525 ns, bisection: 146.369 +- 15.453 ns
haystack_size: 1000 linear: 356.485 +- 6.946 ns, bisection: 125.843 +- 35.119 ns
haystack_size: 2000 linear: 624.713 +- 36.474 ns, bisection: 174.018 +- 30.987 ns
haystack_size: 5000 linear: 1304.450 +- 28.181 ns, bisection: 166.940 +- 55.463 ns
haystack_size: 10000 linear: 2553.567 +- 84.002 ns, bisection: 121.813 +- 83.126 ns
haystack_size: 20000 linear: 5088.757 +- 398.548 ns, bisection: 163.010 +- 169.516 ns
haystack_size: 50000 linear: 12526.477 +- 511.572 ns, bisection: 148.117 +- 138.095 ns
haystack_size: 100000 linear: 23110.413 +- 1033.834 ns, bisection: 209.931 +- 193.116 ns

It turns over somewhere between 50 and 100. My guess was a good one!

Actually, no it was a lucky first trial of the other parameters. With inv_density=5.0 (probability of finding a match is much lower), the turn-over point is between 200 and 500:

haystack_size: 1 linear: 3.668 +- 0.148 ns, bisection: 6.976 +- 0.023 ns
haystack_size: 2 linear: 3.704 +- 0.134 ns, bisection: 7.670 +- 0.071 ns
haystack_size: 5 linear: 4.431 +- 0.030 ns, bisection: 11.712 +- 0.062 ns
haystack_size: 10 linear: 5.698 +- 0.088 ns, bisection: 14.712 +- 0.018 ns
haystack_size: 20 linear: 9.343 +- 0.770 ns, bisection: 18.905 +- 0.151 ns
haystack_size: 50 linear: 19.527 +- 1.575 ns, bisection: 25.500 +- 0.370 ns
haystack_size: 100 linear: 37.332 +- 1.221 ns, bisection: 53.944 +- 0.820 ns
haystack_size: 200 linear: 87.906 +- 3.063 ns, bisection: 110.730 +- 2.791 ns
haystack_size: 500 linear: 216.226 +- 15.276 ns, bisection: 154.511 +- 11.501 ns
haystack_size: 1000 linear: 363.567 +- 14.152 ns, bisection: 169.031 +- 24.139 ns
haystack_size: 2000 linear: 679.980 +- 26.190 ns, bisection: 183.090 +- 41.826 ns
haystack_size: 5000 linear: 1586.080 +- 73.245 ns, bisection: 179.635 +- 61.417 ns
haystack_size: 10000 linear: 3041.679 +- 192.377 ns, bisection: 162.697 +- 88.346 ns
haystack_size: 20000 linear: 5882.436 +- 186.369 ns, bisection: 136.003 +- 125.089 ns
haystack_size: 50000 linear: 15706.292 +- 627.261 ns, bisection: 178.308 +- 184.746 ns
haystack_size: 100000 linear: 30758.378 +- 2171.481 ns, bisection: 212.002 +- 181.188 ns

Removing the "don't make them all return True" line (so that the probability of finding a match is 100%), the turn-over point is pretty close to 200:

haystack_size: 1 linear: 1.127 +- 0.177 ns, bisection: 1.505 +- 0.006 ns
haystack_size: 2 linear: 4.244 +- 0.120 ns, bisection: 5.321 +- 0.038 ns
haystack_size: 5 linear: 7.433 +- 0.067 ns, bisection: 9.627 +- 0.067 ns
haystack_size: 10 linear: 8.991 +- 0.300 ns, bisection: 12.454 +- 0.054 ns
haystack_size: 20 linear: 11.678 +- 0.395 ns, bisection: 15.544 +- 0.123 ns
haystack_size: 50 linear: 19.318 +- 0.552 ns, bisection: 22.298 +- 0.413 ns
haystack_size: 100 linear: 44.263 +- 1.149 ns, bisection: 52.754 +- 1.195 ns
haystack_size: 200 linear: 108.139 +- 2.209 ns, bisection: 111.008 +- 1.986 ns
haystack_size: 500 linear: 230.535 +- 7.110 ns, bisection: 160.800 +- 15.092 ns
haystack_size: 1000 linear: 371.500 +- 28.397 ns, bisection: 169.795 +- 31.917 ns
haystack_size: 2000 linear: 632.418 +- 90.036 ns, bisection: 100.565 +- 58.754 ns
haystack_size: 5000 linear: 1212.563 +- 124.871 ns, bisection: 117.561 +- 63.530 ns
haystack_size: 10000 linear: 2282.108 +- 249.024 ns, bisection: 152.931 +- 126.055 ns
haystack_size: 20000 linear: 3637.732 +- 239.075 ns, bisection: 118.751 +- 107.246 ns
haystack_size: 50000 linear: 8607.739 +- 747.016 ns, bisection: 150.027 +- 143.799 ns
haystack_size: 100000 linear: 17485.448 +- 1070.246 ns, bisection: 200.325 +- 182.548 ns

So it depends on the probability of finding a match (naturally), but over that whole probability range, from nearly 0% to exactly 100%, the turn over is somewhere in the vicinity of 50 to 500. That order of magnitude.

If the average len(haystack[i]) is under 50, go for the linear search!

agoose77 Feb 13, 2024
Maintainer

@jpivarski thank you so much for doing this investigation! Whilst we have to be careful about extrapolating to other hardware etc, it's really nice to be able to point to this as a clear reference for "if you know your data, you should use X solution". I'll keep this bookmarked in my mind for future.

Star9daisy · 2024-02-15T04:19:34Z

Star9daisy
Feb 15, 2024
Author

Thank you so much @agoose77 @jpivarski . I've learned a lot from your codes and discussions!

Here's a summary of all kinds of "in" operation that may help me figure out which one I need:

Function	Explanation	Results
`in`	`<scalar> in <array>`	`<scalar>`
`np.isin`	`<1d array> isin <1d array>`	`<1d array>`
`ragged_isin` or `isin_kernel_2d`	`<1d array> isin <2d array>`	`<1d array>`

The new proposed isin function is applied for the cluster case: <1d array> is one of EFlowTrack, EFlowPhoton, EFlowNeutralHadron, <2d array> is FatJet.Constituents for one event. In this way, it looks like I have to loop over all events. This does not seem to comply with the principles: if I use an array, it's better drop the loop. And the output shape could be better if it is a 2d since the input is 2d.

The code snippets help me a lot how to take the advantage of numba (which I should have learned long time before...), so here's my_isin that I've tried:

@nb.njit
def my_isin(array, test_array):
    results = []

    for record, test_record in zip(array, test_array):
        mask = np.zeros(len(record), dtype=np.bool_)

        for i in range(len(record)):
            for j in range(len(test_record)):
                if record[i] == test_record[j]:
                    mask[i] = True
                    break

        results.append(mask)

    return results

It assumes that inputs are stacked by records. Now the summary table is:

Function	Explanation	Results
`in`	`<scalar> in <array>`	`<scalar>`
`np.isin`	`<1d array> isin <1d array>`	`<1d array>`
`ragged_isin` or `isin_kernel_2d`	`<1d array> isin <2d array>`	`<1d array>`
`my_isin`	`<2d array> isin <2d array>`	`<2d array>`

Since the first dimension means the total number of records, so it essentially is a np.isin with a loop. Combined with the background of jets and their constituents, it could be explained as:

I don't care about which jet the constituents belong to. I want to cluster jets from all of them just like the Delphes does.

I also use the magical command %%timeit to test the performance gain without the loop(#3015 (comment)):

Case	Time
no loop	55.1 ms ± 775 µs
with loop	456 ms ± 5.98 ms

Below is the complete code:

import awkward as ak
import fastjet as fj
import numba as nb
import numpy as np
import uproot
import vector

vector.register_awkward()

filepath = "../data/pp2wz/Events/run_01_decayed_1/tag_1_delphes_events.root"
events = uproot.open(f"{filepath}:Delphes")

all_ref_ids = events["FatJet.Constituents"].array()["refs"]
all_ref_ids = ak.flatten(all_ref_ids, axis=-1) # ---> New: 100 * var * var * int32 -> 100 * var * int32
                                                                         #                    flatten the last dimension to remove jet group.

all_tracks = events["EFlowTrack.fUniqueID"].array()
all_photons = events["EFlowPhoton.fUniqueID"].array()
all_neutral_hadrons = events["EFlowNeutralHadron.fUniqueID"].array()

all_tracks = ak.zip(
    {
        "pt": events["EFlowTrack.PT"].array(),
        "eta": events["EFlowTrack.Eta"].array(),
        "phi": events["EFlowTrack.Phi"].array(),
        "mass": events["EFlowTrack.Mass"].array(),
        "id": all_tracks,
    },
    with_name="Momentum4D",
)
all_photons = ak.zip(
    {
        "pt": events["EFlowPhoton.ET"].array(),
        "eta": events["EFlowPhoton.Eta"].array(),
        "phi": events["EFlowPhoton.Phi"].array(),
        "mass": ak.zeros_like(events["EFlowPhoton.ET"].array()),
        "id": all_photons,
    },
    with_name="Momentum4D",
)
all_neutral_hadrons = ak.zip(
    {
        "pt": events["EFlowNeutralHadron.ET"].array(),
        "eta": events["EFlowNeutralHadron.Eta"].array(),
        "phi": events["EFlowNeutralHadron.Phi"].array(),
        "mass": ak.zeros_like(events["EFlowNeutralHadron.ET"].array()),
        "id": all_neutral_hadrons,
    },
    with_name="Momentum4D",
)

@nb.njit
def my_isin(array, test_array):
    results = []

    for record, test_record in zip(array, test_array):
        mask = np.zeros(len(record), dtype=np.bool_)

        for i in range(len(record)):
            for j in range(len(test_record)):
                if record[i] == test_record[j]:
                    mask[i] = True
                    break

        results.append(mask)

    return results

matched_tracks = all_tracks[my_isin(all_tracks.id, all_ref_ids)]
matched_photons = all_photons[my_isin(all_photons.id, all_ref_ids)]
matched_neutral_hadrons = all_neutral_hadrons[
    my_isin(all_neutral_hadrons.id, all_ref_ids)
]

all_constituents = ak.concatenate(
    [matched_tracks, matched_photons, matched_neutral_hadrons], axis=1
)

particles = ak.zip(
    {
        "pt": all_constituents.pt,
        "eta": all_constituents.eta,
        "phi": all_constituents.phi,
        "mass": all_constituents.mass,
    },
    with_name="Momentum4D",
)

jet_def = fj.JetDefinition(fj.antikt_algorithm, 1.0)
cluster = fj.ClusterSequence(particles, jet_def)
jets = cluster.inclusive_jets()

print(f"pt: {jets[6, 0].pt}")
print(f"eta: {jets[6, 0].eta}")
print(f"phi: {jets[6, 0].phi}")
print(f"mass: {jets[6, 0].m}")

# pt: 563.0124337033961
# eta: 1.0792125821032512
# phi: 2.0480975146837075
# mass: 178.84373299160538

0 replies

Star9daisy · 2024-02-17T11:00:50Z

Star9daisy
Feb 17, 2024
Author

Since isin could only determine weather or not "array" in the "test_array" in elementwise way, inspired by your answers, I improve the function my_isin as following:

@nb.njit
def find_1d_in_1d(a, b):
    index_array = []

    for record, test_record in zip(a, b):
        indices = []
        for i in range(len(record)):
            for j in range(len(test_record)):
                if record[i] == test_record[j]:
                    indices.append(i)
                    break

        index_array.append(indices)

    return index_array

In the question example, it can be used like:

import awkward as ak

array = ak.Array(
    [
        [1, 2, 3],
        [],
        [4, 5],
        [6, 7, 8, 9],
    ]
)

test_array = ak.Array(
    [
        [0, 2],
        [],
        [4],
        [6, 8],
    ]
)

print(find_1d_in_1d(array, test_array))
# [[1], [], [0], [0, 2]]

# Index the corresponding elements in array
print(array[find_1d_in_1d(array, test_array)])
# [[2], [], [4], [6, 8]]

And this is the 2d case:

@nb.njit
def find_1d_in_2d(a, b):
    index_array = []

    for record_a, record_b in zip(a, b):
        indices_per_a = []

        for i in range(len(record_b)):
            indices_per_b = []
            for j in range(len(record_b[i])):
                for k in range(len(record_a)):
                    if record_b[i][j] == record_a[k]:
                        indices_per_b.append(k)

            indices_per_a.append(indices_per_b)
        index_array.append(indices_per_a)

    return index_array

import awkward as ak

array = ak.Array(
    [
        [1, 2, 3],
        [],
        [4, 5],
        [6, 7, 8, 9],
    ]
)

test_array = ak.Array(
    [
        [[0, 2], [1, 2, 3]],
        [[]],
        [[4]],
        [[6, 8]],
    ]
)

print(find_1d_in_2d(array, test_array))
# [[[1], [0, 1, 2]], [[]], [[0]], [[0, 2]]]

However, after I check link, there's no "fancy index" in awkward like numpy.

If awkward could support this feature or I could do something about these, it would be better. But I'm not sure this feature is needed in wider cases. In my situation, reclustering constituents of jets also needs the fastjet support, so this may be too much to finish though...

1 reply

agoose77 Feb 19, 2024
Maintainer

Awkward Array does implement a kind of fancy indexing -- if you build a ragged array with the same structure (ignoring regular vs ragged; you need at least one ragged dimension in the index array), you can pull out particular items from a given dimension. You've linked to the tutorial that demonstrates this.

I'm aware that you have more questions, but I'm also not entirely sure what you're asking / trying to do. So, let me give a bit more context to hopefully provide some guidance!

Awkward Array tries to provide tools for you to perform useful analyses over ragged arrays. We don't have all operations, e.g. np.isin, and in those cases you'll need to implement them yourself. There are two "classes" of ragged operations:

Operations that look at the exterior axis=-1
Operations that look at interior axis!=-1

Implementing custom variants of (1) is trivial if you can write a 2D kernel:

Implement the operations for the 2D case (where axis=-1 is axis=1).
Flatten / recurse until you have an array of 2D
Apply the operation
Unflatten until you have your original structure.

You can perform manual flattening/unflattening operations yourself, or you can use ak.transform (which implements the recursion automatically).

Alternatively, you can just write a Numba function that handles an array of that dimensionality, i.e. a 2D function, a 3D function, etc. It's generally better to just write a 2D kernel and use recursion, though, because of all of the other features that Awkward supports (like option types) - these are free if you use recursion in most cases.

If you want to operate along interior axes, then it is more complicated. These are not just "flatten-transform-unflatten" operations, these are "swapaxes-flatten-transform-unflatten-unswapaxes", which we don't provide any constructs to implement yourself. However, in my experience, nearly all operations that need to be implemented with a custom kernel do not usually involve operations on interior axes.

So, if your question is: "how do I find the indices of elements from one array in another, and use those indices to restructure the array", then we definitely have constructs for this. You'll need to write the "is_in" kernel yourself, but the rest is just ak.transform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any equivalent operation like np.argwhere? #3015

{{title}}

Replies: 6 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Any equivalent operation like np.argwhere? #3015

Star9daisy Feb 9, 2024

Replies: 6 comments · 5 replies

agoose77 Feb 9, 2024 Maintainer

jpivarski Feb 9, 2024 Maintainer

Star9daisy Feb 10, 2024 Author

Star9daisy Feb 10, 2024 Author

agoose77 Feb 12, 2024 Maintainer

jpivarski Feb 12, 2024 Maintainer

Footnotes

jpivarski Feb 12, 2024 Maintainer

agoose77 Feb 13, 2024 Maintainer

Star9daisy Feb 15, 2024 Author

Star9daisy Feb 17, 2024 Author

agoose77 Feb 19, 2024 Maintainer

Star9daisy
Feb 9, 2024

Replies: 6 comments 5 replies

agoose77
Feb 9, 2024
Maintainer

jpivarski
Feb 9, 2024
Maintainer

Star9daisy
Feb 10, 2024
Author

Star9daisy
Feb 10, 2024
Author

agoose77 Feb 12, 2024
Maintainer

jpivarski Feb 12, 2024
Maintainer

jpivarski Feb 12, 2024
Maintainer

agoose77 Feb 13, 2024
Maintainer

Star9daisy
Feb 15, 2024
Author

Star9daisy
Feb 17, 2024
Author

agoose77 Feb 19, 2024
Maintainer