Using Awkward Arrays in or with Xarray #27
Replies: 10 comments
-
This is a good place to discuss (the Google Doc is getting stale as real development surpasses the planning). And congrats on the first issue! Although I haven't explicitly used xarray, I know about it and its place in the ecosystem. Whereas Pandas represents two-dimensional, indexed tables (or more dimensions, sparsely, through Awkward array doesn't conflict in scope or purpose with Pandas or xarray—it does something they don't: arbitrary data structures. (In a "first class" sense; Pandas can put Python data in cells but can't do vector operations on them the way that Awkward can.) So, as a Venn diagram, Awkward is not a subset of Pandas/xarray. But knowing that indexed keys are so useful, they've been integrated into the plan. In Awkward, the index column is called Specifically, an Awkward Identity for one element of an array is a tuple of integers and strings: integers indicate the array position at each level of nesting, and strings represent any table inclusion. When you select an element from an awkward array[12, "outer_field1", 56, 0, "inner_field4", 99] the Identity for that element is NumPy without indexing has its place, and it is usually faster than Pandas because an index is one more thing to carry through all operations. Therefore, indexes are optional in Awkward. It is one of the few mutable attributes of an Awkward array: with a dataset you'd like to use as your "starting" form, you call In the planning documents, I mention Pandas a lot, but xarray would also be a good target for integration. I haven't looked into how integration with xarray works or what it requires. (Pandas requires a lot: my arrays need to be subclasses of a Pandas extension class.) Since we can now preserve index keys and pass them through an Awkward calculation, we should have an easier time converting to and from Pandas and xarray. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed breakdown. I learned something today :)
Any place where the real progress is documented? You might consider using the kanban of Github: https://github.com/scikit-hep/awkward-1.0/projects It sounds promising how it goes. If Pandas is (almost) fully supported, I assume xarray shouldn't be much of a technical challenge. xarray has the functions: You might want to rename the functions You can keep this issue open as a reminder or other people want to pitch in their ideas, but I don't mind if you close it either. |
Beta Was this translation helpful? Give feedback.
-
This is a good place for up-to-date discussions. Implementation is happening as fast as possible to get to a minimum viable product, and documenting during that progress would hamper it. So just ask here and I'll try to answer your questions. I started developing the high-level |
Beta Was this translation helpful? Give feedback.
-
I think I can close this now. Cheers! |
Beta Was this translation helpful? Give feedback.
-
@jpivarski, I just found this issue after watching your very nice presentation for SciPy 2020. Here are some thoughts (note: I'm a xarray contributor): Xarray supports Another item of xarray's development roadmap is adding support for flexible indexes. While I don't have any specific use case in mind, I could imagine how useful would be to use Awkward arrays as xarray coordinates and wrap the I may be missing other cases where integration between Awkward and xarray would be useful. |
Beta Was this translation helpful? Give feedback.
-
Note: indexes are also optional in xarray. |
Beta Was this translation helpful? Give feedback.
-
That would be super-cool, if Awkward Arrays could just be passed into xarray to gain labeled axes: it would be good separation of concerns on our side (we don't have to invent labeled axes) and on xarray's side (you get ragged arrays and other data structures for free). I think I don't know enough about xarray—in my mind, it's "n-dimensional Pandas." We managed to pass Awkward Arrays into Pandas Series and DataFrames, but this introduces sticky inheritance problems (we get a lot of methods we don't want and have to preemptively import Pandas) and doesn't seem to provide much benefit because Pandas's functions see the Awkward data structures as black boxes. In toy analyses invented to demonstrate this feature, we find ourselves taking the arrays out of Pandas again to actually use them. Issue #350 is a call for use-cases before deciding where to remove this feature. It would already help if the way xarray accesses arrays is protocol-based so that we don't have to inherit from it (the import problem) and our structure-aware functions would still be usable (the black box problem). If this works smoothly for xarray in a way that it didn't for Pandas, that's more than just being "n-dimensional Pandas," it's being better integrated into the NumPy ecosystem. If xarray-Awkward integration works, there would be more reason to deprecate the Pandas-Awkward integration and I would recommend physicist users to consider xarray instead of Pandas. That would be an uphill trek—I've counted a lot of questions from the last few years about Pandas and only one or two about xarray, but the ability to use their data is a strong selling point. Is there a Pandas → xarray cheat sheet that could help users who are considering migrating? Also, my assumption is that xarray would wrap around Awkward Arrays, rather than the other way around. We'd have to be sure that the |
Beta Was this translation helpful? Give feedback.
-
Yes, that's right, but it can also be viewed as "labelled numpy-like arrays", or "in-memory netcdf data model implementation". This all may be a bit confusing, I admit.
Agreed, that's what I suggested in my comment above (sorry if I was misleading). I think that efforts towards integration between xarray and awkward array would most likely happen on the xarray side, so I opened pydata/xarray#4285.
I guess it would work (see pydata/xarray#3117).
For any xarray variable object, the underlying array object can be accessed via the
I'm not aware of a Pandas → xarray cheat sheet, but I know @rabernat is going to work soon on improving the documentation, so that could be an idea. Regarding the data (formats), one possible issue here are the file formats used by xarray users (e.g., netcdf) vs. the formats used by pandas users (e.g., columnar storage, parquet). BTW, I saw in the Awkward user guide a (still empty) section about Zarr. Just curious, is it a format that Awkward will eventually support? (there is already good support for it in xarray). |
Beta Was this translation helpful? Give feedback.
-
Here's what we're thinking: zarr-developers/zarr-specs#62. Awkward arrays can be decomposed into a set of flat arrays (ak.to_arrayset and ak.from_arrayset), and an interface that is aware that it's supposed to put them together and can pass the JSON-formatted "Form" that Awkward needs to do this reassembly would be able to transparently pass Awkward arrays to and from storage formats designed for rectilinear arrays. (Actually, all we really need is a key-value binary blob store.) For the users that I'm directly supporting, particle physicists, ROOT files are the format of choice. The Uproot library reads ROOT files as Awkward arrays. For non-physicists, I've added ak.to_parquet and ak.from_parquet, which go through pyarrow, as well as Arrow itself (ak.to_arrow and ak.from_arrow). All of these formats—ROOT, Parquet, Arrow—preserve the nested data structures without needing to break the Awkward array down into an array-set (and run the risk of someone getting a file full of array pieces and not knowing what to do with it). The previous version of Awkward had an interface to HDF5, but the recipient of the HDF5 file would need to know that they should open it through Awkward. If there was metadata in HDF5 or NetCDF to say, "Pass this group through such-and-such an external function, raising an ImportError if it can't be loaded, and return the result as one array object," then that would solve the problem and we could re-introduce this I/O option. That's essentially what we're talking about doing with Zarr v3: adding an extension that says, "Try to interpret this array-set in library X." |
Beta Was this translation helpful? Give feedback.
-
Follow up on this topic at scikit-hep/ragged#6 |
Beta Was this translation helpful? Give feedback.
-
Not sure if here or the Google doc is better (but no support for Markdown), so feel free to move it.
When searching for more user-friendly approaches to arrays, I came across
xarray
, which allows for labelled selection of ND arrays: http://xarray.pydata.org/en/stable/why-xarray.html . While this library extends Pandas with N-dimensional arrays, it is still limited to rectangle arrays.From their documentation:
After using
awkward-array
for a bit, I feel there are some similarities withTable
:Some possible advantages of xarray:
xarray.Dataset
?netCDF
, which are HDF5 files + self-describing dataNot sure how exactly this AwkwardArray and xarray would integrate, but I think it would be good to give it some consideration. Maybe it's not even possible, but some elements could be useful?
What are your thoughts on it?
Beta Was this translation helpful? Give feedback.
All reactions