Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.zip seems to work recursively, but it doesn't really #2363

Open
jpivarski opened this issue Apr 5, 2023 · 2 comments
Open

ak.zip seems to work recursively, but it doesn't really #2363

jpivarski opened this issue Apr 5, 2023 · 2 comments
Labels
policy Choice of behavior

Comments

@jpivarski
Copy link
Member

In dask-contrib/dask-awkward/issues/213, @masonproffitt asked for

>>> a = ak.Array([1])
>>> ak.zip({'a': a, 'b': {'c': a}})
<Array [{a: 1, b: {...}}] type='1 * {a: int64, b: {c: var * int64}}'>

to work in dask-awkward as it does in Awkward. But ak.zip doesn't really do a nested zip; it just calls ak.to_layout on each of the values of the dict.

if isinstance(arrays, dict):
behavior = behavior_of(*arrays.values(), behavior=behavior)
recordlookup = []
layouts = []
num_scalars = 0
for n, x in arrays.items():
recordlookup.append(n)
try:
layout = ak.operations.to_layout(
x, allow_record=False, allow_other=False
)
except TypeError:
num_scalars += 1
layout = ak.operations.to_layout(
[x], allow_record=False, allow_other=False
)
layouts.append(layout)

For a nested dict (general, non-Awkward, non-ndarray container), that means it switches over into ak.from_iter, which (1) is slow, (2) ignores numeric types, and (3) doesn't zip: it makes the difference between an array of structs and a struct of arrays in the data type that you get back.

>>> array = ak.zip({
...     "a": {"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)},
...     "d": {"e": np.arange(10, dtype=np.int32), "f": np.arange(10, dtype=np.float32)},
... })
>>> array.show(type=True)
type: {
    a: {
        b: var * int64,
        c: var * int64
    },
    d: {
        e: var * int64,
        f: var * float64
    }
}
{a: {b: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], c: [0, 1, ..., 9]},
 d: {e: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], f: [0, 1, ..., 9]}}

whereas

>>> array2 = ak.zip({"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)})
>>> array2.show(type=True)
type: 10 * {
    b: int8,
    c: int16
}
[{b: 0, c: 0},
 {b: 1, c: 1},
 {b: 2, c: 2},
 {b: 3, c: 3},
 {b: 4, c: 4},
 {b: 5, c: 5},
 {b: 6, c: 6},
 {b: 7, c: 7},
 {b: 8, c: 8},
 {b: 9, c: 9}]

You see all of the integer types turn into int64 and float32 into float64 because ak.from_iter treats them as Python int and float, which loses dtype. You also see a different structure for the nested object.

It's not obvious to me what the correct behavior is. Treating any expected array-like uniformly with ak.to_layout is good for consistency, but @masonproffitt's interpretation is natural, too.

Originally posted by @jpivarski in dask-contrib/dask-awkward#213 (comment)

@jpivarski jpivarski added the policy Choice of behavior label Apr 5, 2023
@agoose77
Copy link
Collaborator

agoose77 commented May 9, 2023

I agree that this is a policy question.

I'd vote in favour of not recursively zipping, because ak.zip accepts useful parameters that might not apply to each call to ak.zip identically, i.e. the user may well want different depth_limit values. For simplicity and consistency, I'd prefer to require the user to call ak.zip multiple times.

@masonproffitt
Copy link

I'd vote in favour of not recursively zipping, because ak.zip accepts useful parameters that might not apply to each call to ak.zip identically, i.e. the user may well want different depth_limit values. For simplicity and consistency, I'd prefer to require the user to call ak.zip multiple times.

I have no problem with this being the default behavior, but I'd love to see automatic recursive zipping as a feature (maybe as an optional argument to ak.zip?). This is a pretty common use case in handling func-adl-uproot queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
policy Choice of behavior
Projects
Status: P3
Development

No branches or pull requests

3 participants