Scalable variable-length arrays #1396

jpivarski · 2022-04-04T20:06:17Z

jpivarski
Apr 4, 2022
Maintainer

@LunarLanding, your pydata/xarray#4118 (comment) looks pretty similar to Awkward Array, as @tacaswell has pointed out. Instead of the arbitrary graph of n-dimensional arrays, we have a tree structure of (mostly) 1-dimensional arrays, though the nodes in that tree are specialized to emulate generic data structures, like variable-length lists, records/structs, missing data, and heterogeneous unions.

It's different from the xarray Datatree proposal that that thread was about in that Awkward Arrays allow for the number of datasets with different lengths to scale. For instance, this array contains 134 million lists of different lengths and uses 4.5 GB of RAM to do so:

>>> # version 2.0 is hidden as an experimental submodule within version 1.8.0
>>> import awkward._v2 as ak
>>>
>>> # caution: 3.8 GB download
>>> array = ak.from_parquet("s3://pivarski-princeton/chep-2021-jagged-jagged-jagged/zlib9-jagged1.parquet")
>>> array
<Array [[-0.423, 2.34, ..., -0.298], ...] type='134217728 * var * float32'>
>>> array.show()
[[-0.423, 2.34, -0.757, 0.732, -2.63, ..., -0.129, -0.297, 0.597, -0.298],
 [-1.94, 0.835, -0.14, -0.742, -0.369, 1.8],
 [-0.467, 0.315, 0.472, 0.592, -1.14, ..., 0.421, -0.689, 0.875, -0.631, 0.505],
 [0.136, 0.375, 1.41],
 [-0.132, 0.113, -2.2, 0.943, -0.466, -1.16, -0.351, -0.866, 0.494, 0.159],
 [-0.563, 0.43, 0.843, -1.23, -0.305, 0.528, -0.0884, -0.77, 1.31, -0.653],
 [0.279, 2.25, 0.599, 0.857, 1.58, 0.557, -0.8, -0.459],
 [-0.0717, -0.776, -1.22, 1.07],
 [-2.27, 0.365, 0.977, 1.17, -0.141, 0.731, 0.171, -0.565, 1.94, 2.48],
 [-2.12, -0.0187, 1.19, -1.56, 0.165, -1.32, -0.19],
 ...,
 [0.637, 0.974, 0.338, -0.313, -0.239, 1.57, -0.0724],
 [0.632, 0.838, 0.542, -0.342, 0.43, -1.13, -1.53, 1, 0.398, -0.438],
 [1.17, -0.288, 0.0477, -0.656, -0.61, ..., 0.441, 0.142, -0.0544, -0.697],
 [0.661, 1.21, -0.111, -0.645],
 [0.455, 0.0988, 0.826, 0.196, 1.51, ..., 0.561, -0.456, -1.58, 0.608, 0.537],
 [-0.993, 0.708, 1.76, 0.186, -0.413, -0.538, 0.13, 0.0459],
 [0.417, 1.01, -1.29, -0.397],
 [-1.2, -0.913, 2.6, 1.47, -0.855, ..., 0.186, 1.14, -0.131, 1.09, -1.11],
 [-1.17, 1.41, 0.72]]
>>> len(array)
134217728
>>> ak.num(array)
<Array [11, 6, 13, 3, 10, 10, ..., 4, 12, 8, 4, 11, 3] type='134217728 * int64'>
>>> ak.mean(ak.num(array))
8.0
>>> ak.std(ak.num(array))
3.2404736136792627
>>> array.layout
<ListOffsetArray len='134217728'>
    <offsets><Index dtype='int32' len='134217729'>
        [         0         11         17 ... 1073741810 1073741821 1073741824]
    </Index></offsets>
    <content><NumpyArray dtype='float32' len='1073741824'>
        [-0.422626    2.344078   -0.7568038  ... -1.1683336   1.4103566
          0.71982837]
    </NumpyArray></content>
</ListOffsetArray>
>>> array.layout.offsets.data.nbytes
536870916
>>> array.layout.content.data.nbytes
4294967296

It's also different from a sparse matrix in that a sparse matrix is a pair of different-length buffers that are used to represent a logically rectilinear array; an Awkward Array is a set of different-length buffers that are used to represent a logically irregular array. But if you had a sparse matrix implementation on hand, you could probably use it to implement an Awkward Array.

I'd be interested to know if your use-case fits into this model, or if you're thinking of something still more general or just different from what we have here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable variable-length arrays #1396

{{title}}

Replies: 0 comments

Select a reply

Scalable variable-length arrays #1396

jpivarski Apr 4, 2022 Maintainer

Replies: 0 comments

jpivarski
Apr 4, 2022
Maintainer