Definition of negative 'axis' in 'ak.local_index' and strings (and asking for better error messages). #716
Replies: 4 comments 2 replies
-
I suspect that the error is correct for your first issue, though I'm open to improvements in the phrasing of the message. Is it because >>> array = ak.Array([{"x": [1, 2, 3], "y": [[1, 2, 3], [], [4, 5]]}])
>>> print(ak.num(array, axis=-1))
[{x: 3, y: [3, 0, 2]}]
>>> print(ak.local_index(array, axis=-1))
[{x: [0, 1, 2], y: [[0, 1, 2], [], [0, 1]]}] Whereas in NumPy, negative >>> print(ak.num(array.x, axis=1), ak.num(array.y, axis=2))
[3] [[3, 0, 2]]
>>> print(ak.local_index(array.x, axis=1), ak.local_index(array.y, axis=2))
[[0, 1, 2]] [[[0, 1, 2], [], [0, 1]]] In your case, it might be that >>> jets[["pt", "eta", "phi", "m"]] which would keep kinematics while dropping the subjet structure or associated leptons or whatever it is that has deeper jaggedness and is preventing the As for the second issue, the extra dimension on the strings is the list that is the string itself. Strings are not special objects, they're a special interpretation of lists, but only some functions do special things with them. (Try looking at their But maybe this is the wrong behavior? I could convert this issue into a Discussion if these two topics are things you want others to chime in on. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the explanation. I agree these things aren't really bugs, but feel it would be good to have more error messages/documentation to make the user aware of them, and I agree this issue should be converted to a discussion so other people can suggest how they would expect it to behave. For the first issue, having looked at the jets again, I see there is a field ( For the second issue, I hadn't realised awkward was aware that strings are lists, as this is not the case in numpy. I feel the current implementation would probably surprise a lot of people, but also some more experienced users might want to be able to access the strings in this way. Maybe one could add a flag for whether to consider strings as lists, or just objects in the array, either in the Also, maybe index=-1 shouldn't be the default for |
Beta Was this translation helpful? Give feedback.
-
I'll make this a discussion. There are things to do here, but it would be good to get more input. NumPy has two ways of representing strings: as fixed-width bytes (unencoded or UTF-32) and as Python objects. The wasted space in fixed-width strings is severe enough that Pandas defaults to Python objects. However, Python objects can only be used at the Python level (no calculations in C++ without making the C++ layer depend on Python headers). Both of these dtypes can be passed through an Awkward Array, but there are many places where I need to do something with the dtype of an array and only support a reasonable set (booleans, numbers, now including complex, and hopefully soon date-times). Fixed-width bytes and Python object pointers are not included in that set, so as a user, it might work at first, but soon you'd run into something unsupported. But considering that both fixed-width bytestrings and Python objects are highly wasteful, I think we should keep using variable-length strings, as Arrow and Parquet do, but make more methods aware of them. Since negative index handling happens in one place, we could make a string's internal dimension never contribute to the "number of dimensions" used when calculating a negative |
Beta Was this translation helpful? Give feedback.
-
PR #737 fixes your original issue: >>> arr = ak.from_iter([["a", "b", "c"], [], ["d", "e"]])
>>> ak.local_index(arr, axis=-1)
<Array [[0, 1, 2], [], [0, 1]] type='3 * var * int64'> |
Beta Was this translation helpful? Give feedback.
-
I've encountered two (possibly related) bugs when using negative axis indices in local_index (with awkward version 1.1.0rc2). The first is that
ak.local_index(jets, axis=-1)
, where "jets" comes from the coffea Nanoevents schema, gives an error:*** ValueError: axis == -1 exceeds the min depth == 1 of this array
However
ak.local_index(jets, axis=1)
andak.local_index(jets.pt, axis=-1)
both work and give the same output as I would expect fromak.local_index(jets, axis=-1)
. I wasn't able to directly reproduce this problem outside of a coffea processor, as it didn't seem to occur for simpler classes like lorentz vectors. If you think this actually a problem with coffea's Nanoevents rather than an awkard problem I can make an issue there.The second issue, which I found while trying to reproduce the first, is that for an array containing only strings, local index seems to find another axis:
If the array is made of integers, the output is what I would expect:
And if I mix ints and strings, I get the same error as for the jets:
Thanks,
Dominic
Beta Was this translation helpful? Give feedback.
All reactions