Using Awkward Arrays in or with Xarray #27

NumesSanguis · 2019-12-03T08:25:01Z

NumesSanguis
Dec 3, 2019

Not sure if here or the Google doc is better (but no support for Markdown), so feel free to move it.

When searching for more user-friendly approaches to arrays, I came across xarray, which allows for labelled selection of ND arrays: http://xarray.pydata.org/en/stable/why-xarray.html . While this library extends Pandas with N-dimensional arrays, it is still limited to rectangle arrays.
From their documentation:

DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.

Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

Why non-named Tensors (deep learning) are harmful: http://nlp.seas.harvard.edu/NamedTensor
A blog post about xarray in the scientific community: https://medium.com/pangeo/thoughts-on-the-state-of-xarray-within-the-broader-scientific-python-ecosystem-5cee3c59cd2b

After using awkward-array for a bit, I feel there are some similarities with Table:

aw_arr = aw.fromiter([{'foo': np.array([1, 2, 3]), 'bar': ('x', [1, 2]), 'baz': np.pi}])
aw_arr.tolist()
# [{'bar': ['x', [1, 2]], 'baz': 3.141592653589793, 'foo': [1, 2, 3]}]

ds = xr.Dataset({'foo': np.array([1, 2, 3]), 'bar': ('x', [1, 2]), 'baz': np.pi})
ds
# <xarray.Dataset>
# Dimensions:  (foo: 3, x: 2)
# Coordinates:
#   * foo      (foo) int64 1 2 3
# Dimensions without coordinates: x
# Data variables:
#     bar      (x) int64 1 2
#     baz      float64 3.142

Some possible advantages of xarray:

It is getting popular with 1.5k stars: https://github.com/pydata/xarray
xarray has uptake in the Geosciences
Closely integrated with Pandas; Instead of returning a DataFrame MultiIndex for multi-dimensional data, you could return xarray.Dataset?
xarray has integration with Dask, which is what awkward also has on the roadmap
Supports netCDF, which are HDF5 files + self-describing data
Discussion of using PyTorch as backend, which would allow GPU computation and therefore help with Awkward's goal of CPU and GPU computation.

Not sure how exactly this AwkwardArray and xarray would integrate, but I think it would be good to give it some consideration. Maybe it's not even possible, but some elements could be useful?
What are your thoughts on it?

jpivarski · 2019-12-03T12:43:28Z

jpivarski
Dec 3, 2019
Maintainer

This is a good place to discuss (the Google Doc is getting stale as real development surpasses the planning). And congrats on the first issue!

Although I haven't explicitly used xarray, I know about it and its place in the ecosystem. Whereas Pandas represents two-dimensional, indexed tables (or more dimensions, sparsely, through MultiIndex), xarray represents N-dimensional, indexed arrays (or "tensor," as we've been calling N-dimensional arrays these days). Adding index keys of any number of dimensions is incredibly useful for analysis, as it introduces a whole suite of join-like operations that are known to be useful from the SQL world.

Awkward array doesn't conflict in scope or purpose with Pandas or xarray—it does something they don't: arbitrary data structures. (In a "first class" sense; Pandas can put Python data in cells but can't do vector operations on them the way that Awkward can.) So, as a Venn diagram, Awkward is not a subset of Pandas/xarray.

But knowing that indexed keys are so useful, they've been integrated into the plan. In Awkward, the index column is called awkward::Identity. It's a much simpler structure than Pandas's indexes (the Venn diagram doesn't fully overlap either way), but it has enough information to ensure that row identity isn't lost in a conversion from Pandas/xarray to Awkward. (That is, it's isomorphic to a Pandas Index; we'll be able to preserve the Index information through an Awkward calculation.)

Specifically, an Awkward Identity for one element of an array is a tuple of integers and strings: integers indicate the array position at each level of nesting, and strings represent any table inclusion. When you select an element from an awkward array as

array[12, "outer_field1", 56, 0, "inner_field4", 99]

the Identity for that element is (12, "outer_field1", 56, 0, "inner_field4", 99). If you do any filtering or rearrangement of the array, the path to select the element may be different in the filtered/rearranged array, but the Identity is maintained. In physics, an Identity might look like (12345, "muons", 2, "associatedJets", 1), and it would have the same value even if you cut events or combine particles to look for decays. Keeping this information lets us join datasets derived from the same source: if you compute a quantity on filtered/manipulated data, you can still match that quantity with the original data or data that have been filtered/manipulated some other way.

NumPy without indexing has its place, and it is usually faster than Pandas because an index is one more thing to carry through all operations. Therefore, indexes are optional in Awkward. It is one of the few mutable attributes of an Awkward array: with a dataset you'd like to use as your "starting" form, you call array.setid() to recursively assign Identities based on paths from the top-level array.

In the planning documents, I mention Pandas a lot, but xarray would also be a good target for integration. I haven't looked into how integration with xarray works or what it requires. (Pandas requires a lot: my arrays need to be subclasses of a Pandas extension class.) Since we can now preserve index keys and pass them through an Awkward calculation, we should have an easier time converting to and from Pandas and xarray.

0 replies

NumesSanguis · 2019-12-04T01:18:37Z

NumesSanguis
Dec 4, 2019
Author

Thanks for the detailed breakdown. I learned something today :)

(the Google Doc is getting stale as real development surpasses the planning)

Any place where the real progress is documented? You might consider using the kanban of Github: https://github.com/scikit-hep/awkward-1.0/projects

It sounds promising how it goes. If Pandas is (almost) fully supported, I assume xarray shouldn't be much of a technical challenge. xarray has the functions: xr.Dataset.to_dataframe() and xr.Dataset.from_dataframe(), and Pandas the function: pandas.DataFrame.to_xarray().

You might want to rename the functions awkward.topandas() and awkward.frompandas() to awkward.to_pandas() and awkward.from_pandas() to match the naming convention in related libraries (scikit-hep/awkward-0.x#215).

You can keep this issue open as a reminder or other people want to pitch in their ideas, but I don't mind if you close it either.

0 replies

jpivarski · 2019-12-04T04:24:47Z

jpivarski
Dec 4, 2019
Maintainer

This is a good place for up-to-date discussions. Implementation is happening as fast as possible to get to a minimum viable product, and documenting during that progress would hamper it. So just ask here and I'll try to answer your questions.

I started developing the high-level awkward.Array today and it might not need any from_* functions at all. The awkward.Array constructor can take all types and transform them appropriately. I'm actually writing the from_* functions in a submodule, which the constructor uses in its implementation, but the user interface can be just that constructor. That way, users don't have to go looking for the appropriate function, at least when ingesting into Awkward. (The other way is another issue.)

0 replies

jpivarski · 2020-01-04T20:15:55Z

jpivarski
Jan 4, 2020
Maintainer

I think I can close this now. Cheers!

0 replies

benbovy · 2020-07-29T08:43:22Z

benbovy
Jul 29, 2020

@jpivarski, I just found this issue after watching your very nice presentation for SciPy 2020. Here are some thoughts (note: I'm a xarray contributor):

Xarray supports __array_function__ (NEP18), and recently there as been (still ongoing) efforts to integrate xarray with sparse, cupy and pint. As far as I understand, Awkward also supports NEP18 so it shouldn't be too hard to wrap Awkard arrays as variables or coordinates in xarray Datasets or DataArrays. One benefit of this might be in case we have more structured labels but still variable-sized data: we could create a data variable wrapping an Awkward array (where, e.g., each element of the array represent vertices of a geometrical object) and then create one or more coordinates to store some labels for those objects. This way we could reuse those labels for other data variables as well, and also reuse all libraries (visualization, etc.) that already leverage xarray labelling features. Integration between xarray and numpy-like array libraries will be made easier in the future (it is part of xarray's development roadmap).

Another item of xarray's development roadmap is adding support for flexible indexes. While I don't have any specific use case in mind, I could imagine how useful would be to use Awkward arrays as xarray coordinates and wrap the awkward::Identity logic into a xarray-compatible index, e.g., for indexing along dimensions where labels have complex, arbitrary structures.

I may be missing other cases where integration between Awkward and xarray would be useful.

0 replies

benbovy · 2020-07-29T08:49:56Z

benbovy
Jul 29, 2020

NumPy without indexing has its place, and it is usually faster than Pandas because an index is one more thing to carry through all operations. Therefore, indexes are optional in Awkward.

Note: indexes are also optional in xarray.

0 replies

jpivarski · 2020-07-29T13:11:02Z

jpivarski
Jul 29, 2020
Maintainer

That would be super-cool, if Awkward Arrays could just be passed into xarray to gain labeled axes: it would be good separation of concerns on our side (we don't have to invent labeled axes) and on xarray's side (you get ragged arrays and other data structures for free).

I think I don't know enough about xarray—in my mind, it's "n-dimensional Pandas." We managed to pass Awkward Arrays into Pandas Series and DataFrames, but this introduces sticky inheritance problems (we get a lot of methods we don't want and have to preemptively import Pandas) and doesn't seem to provide much benefit because Pandas's functions see the Awkward data structures as black boxes. In toy analyses invented to demonstrate this feature, we find ourselves taking the arrays out of Pandas again to actually use them. Issue #350 is a call for use-cases before deciding where to remove this feature.

It would already help if the way xarray accesses arrays is protocol-based so that we don't have to inherit from it (the import problem) and our structure-aware functions would still be usable (the black box problem). If this works smoothly for xarray in a way that it didn't for Pandas, that's more than just being "n-dimensional Pandas," it's being better integrated into the NumPy ecosystem.

If xarray-Awkward integration works, there would be more reason to deprecate the Pandas-Awkward integration and I would recommend physicist users to consider xarray instead of Pandas. That would be an uphill trek—I've counted a lot of questions from the last few years about Pandas and only one or two about xarray, but the ability to use their data is a strong selling point. Is there a Pandas → xarray cheat sheet that could help users who are considering migrating?

Also, my assumption is that xarray would wrap around Awkward Arrays, rather than the other way around. We'd have to be sure that the __array_function__ methods chain appropriately.

0 replies

benbovy · 2020-07-29T14:26:22Z

benbovy
Jul 29, 2020

in my mind, it's "n-dimensional Pandas."

Yes, that's right, but it can also be viewed as "labelled numpy-like arrays", or "in-memory netcdf data model implementation". This all may be a bit confusing, I admit.

Also, my assumption is that xarray would wrap around Awkward Arrays, rather than the other way around.

Agreed, that's what I suggested in my comment above (sorry if I was misleading). I think that efforts towards integration between xarray and awkward array would most likely happen on the xarray side, so I opened pydata/xarray#4285.

We'd have to be sure that the array_function methods chain appropriately.

I guess it would work (see pydata/xarray#3117).

our structure-aware functions would still be usable

For any xarray variable object, the underlying array object can be accessed via the .data property.

I've counted a lot of questions from the last few years about Pandas and only one or two about xarray, but the ability to use their data is a strong selling point. Is there a Pandas → xarray cheat sheet that could help users who are considering migrating?

I'm not aware of a Pandas → xarray cheat sheet, but I know @rabernat is going to work soon on improving the documentation, so that could be an idea.

Regarding the data (formats), one possible issue here are the file formats used by xarray users (e.g., netcdf) vs. the formats used by pandas users (e.g., columnar storage, parquet). BTW, I saw in the Awkward user guide a (still empty) section about Zarr. Just curious, is it a format that Awkward will eventually support? (there is already good support for it in xarray).

0 replies

jpivarski · 2020-07-29T15:37:40Z

jpivarski
Jul 29, 2020
Maintainer

Regarding the data (formats), one possible issue here are the file formats used by xarray users (e.g., netcdf) vs. the formats used by pandas users (e.g., columnar storage, parquet). BTW, I saw in the Awkward user guide a (still empty) section about Zarr. Just curious, is it a format that Awkward will eventually support? (there is already good support for it in xarray).

Here's what we're thinking: zarr-developers/zarr-specs#62. Awkward arrays can be decomposed into a set of flat arrays (ak.to_arrayset and ak.from_arrayset), and an interface that is aware that it's supposed to put them together and can pass the JSON-formatted "Form" that Awkward needs to do this reassembly would be able to transparently pass Awkward arrays to and from storage formats designed for rectilinear arrays. (Actually, all we really need is a key-value binary blob store.)

For the users that I'm directly supporting, particle physicists, ROOT files are the format of choice. The Uproot library reads ROOT files as Awkward arrays. For non-physicists, I've added ak.to_parquet and ak.from_parquet, which go through pyarrow, as well as Arrow itself (ak.to_arrow and ak.from_arrow). All of these formats—ROOT, Parquet, Arrow—preserve the nested data structures without needing to break the Awkward array down into an array-set (and run the risk of someone getting a file full of array pieces and not knowing what to do with it).

The previous version of Awkward had an interface to HDF5, but the recipient of the HDF5 file would need to know that they should open it through Awkward. If there was metadata in HDF5 or NetCDF to say, "Pass this group through such-and-such an external function, raising an ImportError if it can't be loaded, and return the result as one array object," then that would solve the problem and we could re-introduce this I/O option. That's essentially what we're talking about doing with Zarr v3: adding an extension that says, "Try to interpret this array-set in library X."

0 replies

jpivarski · 2023-12-30T18:47:32Z

jpivarski
Dec 30, 2023
Maintainer

Follow up on this topic at scikit-hep/ragged#6

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Awkward Arrays in or with Xarray #27

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using Awkward Arrays in or with Xarray #27

NumesSanguis Dec 3, 2019

Replies: 10 comments

jpivarski Dec 3, 2019 Maintainer

NumesSanguis Dec 4, 2019 Author

jpivarski Dec 4, 2019 Maintainer

jpivarski Jan 4, 2020 Maintainer

benbovy Jul 29, 2020

benbovy Jul 29, 2020

jpivarski Jul 29, 2020 Maintainer

benbovy Jul 29, 2020

jpivarski Jul 29, 2020 Maintainer

jpivarski Dec 30, 2023 Maintainer

NumesSanguis
Dec 3, 2019

jpivarski
Dec 3, 2019
Maintainer

NumesSanguis
Dec 4, 2019
Author

jpivarski
Dec 4, 2019
Maintainer

jpivarski
Jan 4, 2020
Maintainer

benbovy
Jul 29, 2020

benbovy
Jul 29, 2020

jpivarski
Jul 29, 2020
Maintainer

benbovy
Jul 29, 2020

jpivarski
Jul 29, 2020
Maintainer

jpivarski
Dec 30, 2023
Maintainer