-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Hierarchical storage and processing in xarray #4118
Comments
Thanks @jhamman for sharing the link. Here are my thoughts on the same: For use-cases similar to the one I have mentioned, I think it would be more meaningful to allow the tree structure (calling it Besides, xarray only allows attribute access for getting (and not setting) values, but a separate data structure can allow attribute access for setting values as well. For example, the data structure that I have implemented would allow something like I am currently using attribute-based access for accessing child nodes/data arrays in the Instead of using netCDF4 groups for encoding the
Therefore, within the netCDF file, it would exist just as a Dataset. A specially implemented |
Thanks for writing this up @emilbiju . These are very interesting ideas
|
I would be open to exploring adding a hierarchical data structure into xarray (on an experimental basis, to start), but it would need someone with serious interest and time to make it happen. Certainly there are plenty of use cases across various fields. |
The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined. The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g., Probably a new data structure would be easier at this point, because would keep |
@joshmoore - based on pangeo-forge/pangeo-forge-recipes#27 (comment), you may be interested in this issue. One way to do multiscale datasets in Xarray would be to use hierarchical groups (one group per scale). |
Just a note that this link has moved to: https://arviz-devs.github.io/arviz/getting_started/XarrayforArviZ.html |
Thanks for the link, @jhamman. The most immediate issue I ran into when trying to use xarray with OME-Zarr data does seem similar. A rough representation of one multiscale image is:
but of course the x, y and z dimensions are of different sizes in each volume. |
@jhamman @joshmoore a prototype to bring together XArray and OME-Zarr/NGFF with multiple groups: |
On today's Xarray dev call, we discussed pursuing another CZI grant to support this feature in Xarray. The image pyramid use case would provide a strong link to the bioimaging community. @alexamici and the B-open folks seem enthusiastic. I had to leave the meeting early, so I didn't hear the end of the conversation. But did we decide who might serve as PI for such a proposal? |
No. @emilbiju are you interested in open-sourcing your work? |
FWIW, a while ago I wrote a mock-up (and probably outdated) https://gist.github.com/benbovy/92e7c76220af1aaa4b3a0b65374e233a (nbviewer link) |
This is related to some very recent work we have been doing at NSLS-II, primarily lead by @danielballan . |
Not really sure if there is anything we can do from ArviZ to help with that, if there is let us know and we'll do our best cc @percygautam |
@alexamici and I can write the technical part of the proposal. |
Happy to provide assistance on the image pyramid (i.e. "multiscale") use case. |
So we have:
We are just missing a PI, someone who is willing to put their name on top of the proposal and click submit. I have gone on record as committed to not leading any new proposals this year. And in any case, this is a good opportunity for someone else from the @pydata/xarray core dev team to try on a leadership role. |
I volunteer to contribute writing to this from the condensed matter / sychrotron user facility perspective. |
I can shoulder part of the load and help is definitely needed. LOI is due on Tuesday. I'll take a stab this evening and post a link. |
Here are some biomedical papers that are using ArviZ and therefore xarray even if most don't cite xarray and some don't cite ArviZ either. Topics are quite disperse: covid, psychology, biomolecules, oncology... Some ArviZ recent biomedical citations
|
I'm excited to see this coming together! I would be happy to advise as well... Side note: at some point, this would probably be worth adding to Xarray's official roadmap. |
We could also provide a use-case in remote sensing: it would be really useful in the interferometric processing for managing Sentinel-1 IW and EW SLC data, which has multiple tiles (burts) partially overlapping in one direction (azimuth). |
This sounds like an interesting project - I'm also about to be able to work on xarray much more directly (thanks @rabernat ). Should I add this as another xarray project board alongside explicit indexes and so on? I wonder if this could find another domain use case in plasmapy as part of the overall |
Whoa, that sounds awesome! Thanks for the heads up :) Definitely could be quite handy, looking forward to seeing how this develops. @rocco8773 this should be interesting for you as well :) |
@TomNicholas (cc @mraspaud)
The two main classes of on-disk formats that, I know of, which cannot be always represented in the "group is a Dataset" approach are:
I don't have an example at hand, but my impression is that satellite products that use HDF5 file format also place arrays with inconsistent dimensions / coordinates in the same group. |
One thing that came up in our discussion about this in the developer
meeting today is that we could also pretty easily expose a "low level" API
for IO using dictionaries of xarray.Variable objects. This intermediate
representation could be useful for cleaning up data into a form suitable
for conversion into Dataset objects.
…On Wed, Feb 16, 2022 at 11:39 PM Alessandro Amici ***@***.***> wrote:
@TomNicholas <https://github.com/TomNicholas> (cc @mraspaud
<https://github.com/mraspaud>)
Do you have use cases which one of these designs could handle but the
other couldn't?
The two main classes of on-disk formats that, I know of, which cannot be
always represented in the "group is a Dataset" approach are:
- in netCDF following the CF conventions for groups
<https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#groups>,
it is legal for an array to refer to a dimension or a coordinate in a
different group and so arrays in the same group may have dimensions with
the same name, but different size / coordinate values,
- the current spec for the Next-generation file formats (NGFF)
<https://ngff.openmicroscopy.org> for bio-imaging has all scales of
the same 5D data in the same group.
I don't have an example at hand, but my impression is that satellite
products that use HDF5 file format also place arrays with inconsistent
dimensions / coordinates in the same group.
—
Reply to this email directly, view it on GitHub
<#4118 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVT27QD4RQDYZ2N4W7TU3SQ3BANCNFSM4NQEIKFQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@TomNicholas I also have a few comments on the comparison:
This is only true for flat netCDF files, once you introduce groups in a netCDF AND accept CF conventions the DataGroup approach can map 100% of the files, while the DataTree approach fails on a (admittedly small) class of them.
Both points are only true for the DataArray in a single group, once you broadcast any operation to subgroups the two implementations would share the same limitations (dimensions in subgroups can be inconsistent in both cases). In my opinion the advantage for the DataTree is minimal.
The two approach are identical in this respect, group attributes are mapped in the same way to DataTree and DataGroup I share your views on all other points. |
I'm having difficulties to understand your above point wrt to the scoping rules from the above CF document. Shouldn't it be impossible to create two arrays (in the same group) having dimensions with exactly the same name from different groups? Looking at the example here https://github.com/alexamici/xarray-datagroup there are coordinates with name "/lat" vs "lat". Aren't that two different names? Maybe I'm missing something essential here. |
@kmuehlbauer in the representation I use the fully qualified name for the dimension / coordinate, but the corresponding |
Thanks for clarifying. I'm wondering if that can be a source of misunderstanding. How should the user differentiate that? I mean finally those dimensions which have the same name |
@alexamici can you expand on the role of the CF conventions in this statement? Are you talking about CF conventions allowing one variable in one group to refer to dimension present in another group, or something else? |
I am not sure I completely understand option 2, but option 1 seems a better fit to what we are doing at ArviZ (so far we are managing quite well with the InferenceData mentioned above which is a collection of independent xarray datasets). In our case, well defined selection for multiple variables at the same time (i.e. at the dataset level) is very useful. I was also wondering what changes (if any) would each option imply when using |
Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst. Datagroup and Datatree are subcases of this general structure, which could be enforced via flags/checks. I'm sure I'm missing some big issue with the mental model I have, for instance I haven't thought of transformations at all and about coordinates. But for clarity I tried to write it down below. The most general structure for a dataset I can think of is a directed graph. To get a hierarchical structure, we:
We can resolve D's target by (A) checking for a sibling in T with the same name, and then going up one level and goto (A). Multindexes ( multi-dimensional (sparse) labels ) generalize this model, but require tuple labels in T's edges i.e. : |
Hi @LunarLanding , thanks for your ideas!
It sounds a bit like what you are suggesting is essentially a model in which dimensions are explicit objects, which can be referred to from other groups, like in netCDF. (NetCDF has "dimension IDs".) This would be a bit of a departure from the model that
By "variable" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance? Is there a specific use case which you think would require explicit dimensions to solve? |
Also thanks @OriolAbril , it's useful to have an ArViz perspective.
I see In either case I imagine all we might need to do is slightly extend |
I mean that I might have, for instance, a map from 2 variables to data, ie (x,y)->c, that I can write as a DataArray XY with two dimensions x and y and the values being c.
The use-case is iteratively adding values to a dataset by mapping functions over multiple variables / dimensions in arbitrary compositions. PS: in netcdf-4 dimensions are seen by children, it matches what I previously posted; in HDF5 nodes are hardlinks to the actual data , this might be exactly the xarray-datagroup posted above. Example of ideal datastructure
The datastructure that is more useful for this kind of analysis is the one that is an arbitrary graph of n-dimensional arrays; forcing the graph to have a hierarchical access allows optional organization; the graph itself can exist as python objects for nodes and references for edges. Example:Notation
Start with a 2d-dimensional DataArray:
Map a function
Map a function
Notice that both Suppose I now want to run analysis on f's and g's output, with a function that takes two a's and outputs a float
Compared to what I posted before, I dropped the resolving the dimension for a array by its position in the hierarchy since it would be innaplicable when a variable refers to dimensions in a different branch of the tree. |
@LunarLanding You may also be interested in awkward array. |
Wanted to note issue ( carbonplan/ndpyramid#10 ) here, which may be of interest to people here. Also we are thinking about a Dask blogpost in this space if people have thoughts on what we should include and/or are interested in being involved. Details in issue ( dask/dask-blog#141 ). |
Over 4 years later, closed by v2024.10.0 - see the announcement discussion. Thanks everyone - and especially to @emilbiju for your very generative original ideas here. |
Huge thanks to everyone involved in making this happen! 🚀 |
I am using xarray for processing geospatial data and have encountered two major challenges with existing data structures in xarray:
Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem.
When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown,
nan
is used as a substitute which results in memory wastage.I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array.
To meet these requirements, I have implemented a data structure that also supports the below capabilities:
dt
) with child nodes:weather
,satellite image
andpopulation
. Each of these nodes has data arrays/subtrees under it.The mean over time of all data variables associated with weather can be obtained using
dt.weather.mean('time')
which applies the function tosea_surface_temperature
,dew_point_temperature
,wind_speed
andpressure
.I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.
The text was updated successfully, but these errors were encountered: