Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dak.zip support for nested dictionaries #213

Open
masonproffitt opened this issue Apr 5, 2023 · 2 comments
Open

dak.zip support for nested dictionaries #213

masonproffitt opened this issue Apr 5, 2023 · 2 comments

Comments

@masonproffitt
Copy link
Contributor

ak.zip() works on dicts that have dicts inside of them, which is convenient for being able to create structures on the fly. For example:

>>> import awkward as ak, dask_awkward as dak
>>> a = ak.Array([1])
>>> ak.zip({'a': a, 'b': {'c': a}})
<Array [{a: 1, b: {...}}] type='1 * {a: int64, b: {c: var * int64}}'>

This doesn't work for dak.zip(), however:

>>> da = dak.from_awkward(a, 1)
>>> dak.zip({'a': da, 'b': {'c': da}})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/iris-hep/src/dask-awkward/src/dask_awkward/lib/structure.py", line 1086, in zip
    metadict[k] = coll._meta
AttributeError: 'dict' object has no attribute '_meta'

I can get around this by nesting dak.zip() calls:

>>> dak.zip({'a': da, 'b': dak.zip({'c': da})}).compute()
<Array [{a: 1, b: {...}}] type='1 * {a: int64, b: {c: int64}}'>

but it would certainly be more convenient if the first case worked.

@douglasdavis
Copy link
Collaborator

douglasdavis commented Apr 5, 2023

Ah good find, another path to handle for zip :) Thanks for sharing the workaround!

@jpivarski
Copy link
Collaborator

You know, that worked accidentally. We should explore this more in Awkward itself to make sure that it isn't doing misleading things.

Like, ak.zip's dict case just calls ak.to_layout on each of the values of the dict.

https://github.com/scikit-hep/awkward/blob/6a24ed0d436bcd158f634d9bd9f6d664fff6bd2b/src/awkward/operations/ak_zip.py#L174-L190

If it sees another dict, that might mean that it switches over to ak.from_iter, which wouldn't be what you want if the arrays are large. It also wouldn't be the right type: struct of arrays versus array of structs.

Yeah, as I thought:

>>> array = ak.zip({
...     "a": {"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)},
...     "d": {"e": np.arange(10, dtype=np.int32), "f": np.arange(10, dtype=np.float32)},
... })
>>> array.show(type=True)
type: {
    a: {
        b: var * int64,
        c: var * int64
    },
    d: {
        e: var * int64,
        f: var * float64
    }
}
{a: {b: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], c: [0, 1, ..., 9]},
 d: {e: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], f: [0, 1, ..., 9]}}

whereas

>>> array2 = ak.zip({"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)})
>>> array2.show(type=True)
type: 10 * {
    b: int8,
    c: int16
}
[{b: 0, c: 0},
 {b: 1, c: 1},
 {b: 2, c: 2},
 {b: 3, c: 3},
 {b: 4, c: 4},
 {b: 5, c: 5},
 {b: 6, c: 6},
 {b: 7, c: 7},
 {b: 8, c: 8},
 {b: 9, c: 9}]

You see all of the integer types turn into int64 and float32 into float64 because ak.from_iter treats them as Python int and float, which loses dtype. You also see a different structure for the nested object.

I'll raise this as an issue in Awkward, that ak.zip seems to work recursively, but it doesn't, and what it's doing now is misleading. We can continue the discussion there about what the right behavior is—it's not obvious to me. Treating any expected array-like uniformly with ak.to_layout is good for consistency, but your interpretation is natural, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants