Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperability with xarray/dask #479

Open
bekozi opened this issue Mar 13, 2018 · 14 comments
Open

Interoperability with xarray/dask #479

bekozi opened this issue Mar 13, 2018 · 14 comments
Milestone

Comments

@bekozi
Copy link
Contributor

bekozi commented Mar 13, 2018

Add to_xarray(...) on ocgis fields and potentially a to_ocgis(...) somewhere on the other end.

Ping @jhamman

@jhamman
Copy link

jhamman commented Mar 13, 2018

Thanks for opening this @bekozi.

Is a field a single variable (e.g. temperature)? Or is it a collection of variables?

For a single variable, we'd be looking to map the ocgis structure to an xarray.DataArray. The constructor for that is:

xarray.DataArray(data, coords=None, dims=None, name=None, attrs=None)
  • dims: Tuple of dimension names associated with this array.
  • values: The array’s data as a numpy/dask array.
  • coords: Dictionary-like container of coordinate arrays (these can be numpy arrays or other DataArrays).
  • name: The name of this array.
  • attrs: Dictionary storing arbitrary metadata with this array.

The xarray.Dataset is just a dict-like container of DataArrays. So getting the mapping for a DataArray done would be the first step.

@bekozi
Copy link
Contributor Author

bekozi commented Mar 13, 2018

Is a field a single variable (e.g. temperature)? Or is it a collection of variables?

It's a bit of both. It was easier to have fields also be collections so that is what they are - essentially collections associated with coordinate variables via a dimension map. They are also dict-like containers similar to the xarray.Dataset.

For a single variable, we'd be looking to map the ocgis structure to an xarray.DataArray. The constructor for that is:

Cool, thanks. It seems pretty straightforward, but I do have some questions (turned into more than I expected...):

  1. ocgis has a GeometryVariable storing an n-dimensional object array (it can have a mask too) of shapely geometries. Does xarray work with geopandas? What is the most appropriate way to move data back and forth here?
  2. When working with a decomposition, ocgis dimensions store local and global bounds. Can this information be maintained (elegantly) so converted objects running in parallel don't get lost? This is kind of a detail, but worth keeping in mind... Actually, most of these questions are details.
  3. Does xarray do anything special with coordinate systems (e.g. WGS84)?
  4. Are coordinate bounds variables just passed to coords?
  5. Is the best method for passing masked data to values a masked array? Are fill values maintained?
  6. Does xarray support ragged arrays (i.e. netcdf variable length types)? This is important for UGRID data.
  7. It looks like groups are supported in xarray so we can move those over as well. Anything to be aware of there?
  8. ocgis maintains singleton dimensions. Will these be squeezed by default? Undimensioned variables okay in xarray?

Sorry for the barrage here. Take your time. 😁

@jhamman
Copy link

jhamman commented Mar 13, 2018

ocgis has a GeometryVariable storing an n-dimensional object array (it can have a mask too) of shapely geometries. Does xarray work with geopandas? What is the most appropriate way to move data back and forth here?

There hasn't been much (any?) direct integration with geopandas. We can store arrays of objects in xarray but there won't be the geo-hooks that geopandas has. I think that is okay for a first cut.

When working with a decomposition, ocgis dimensions store local and global bounds. Can this information be maintained (elegantly) so converted objects running in parallel don't get lost? This is kind of a detail, but worth keeping in mind... Actually, most of these questions are details.
Does xarray do anything special with coordinate systems (e.g. WGS84)?

If these are scalars (or tuples), they can either be stored as coordinates or as attributes. If they are attributes, they won't be used by xarray's operations but they will stay with the data.

Are coordinate bounds variables just passed to coords?

Yes, usually. If they have different dimensions/shapes, they need a bit of special treatment but for now, let's treat them as coords.

Is the best method for passing masked data to values a masked array? Are fill values maintained?

If you pass in a masked array, xarray will convert it to a standard numpy array and use nan's as fill values. This is mostly for performance. I'm not 100% sure but I don't think we store the fill value. We could probably come up with something though.

Does xarray support ragged arrays (i.e. netcdf variable length types)? This is important for UGRID data.

No. There are conversations in the works for supporting sparse arrays but this is not a feature yet.

It looks like groups are supported in xarray so we can move those over as well. Anything to be aware of there?

Do you mean groups in the hierarchical sense (like HDF)? This is not supported by xarray.

ocgis maintains singleton dimensions. Will these be squeezed by default?

I think this should work (as long as the array shape has the same number of dimensions as the specified dimensions).

Undimensioned variables okay in xarray?

Yes. Xarray will add dummy dimension names (e.g. dim_0) but this is supported.

@bekozi
Copy link
Contributor Author

bekozi commented Mar 13, 2018

Thanks for the quick response @jhamm. All sounds pretty good!

There hasn't been much (any?) direct integration with geopandas. We can store arrays of objects in xarray but there won't be the geo-hooks that geopandas has. I think that is okay for a first cut.

Got it. I too think that'll work okay at first.

I'm not 100% sure but I don't think we store the fill value. We could probably come up with something though.

Not that big of deal, more curious how masking is treated. It only matters in very specific cases.

No. There are conversations in the works for supporting sparse arrays but this is not a feature yet.

I think the object arrays will suffice now that I think about it.

Do you mean groups in the hierarchical sense (like HDF)? This is not supported by xarray.

I do. And oops, I thought I saw this as a feature. This will be a problem with spatial collections where a subset geometry stores it's associated subsetted field. It should be possible to get around this by "melting" but will take a bit of leg work. Not as big a deal with unioning and spatial averaging.

@bekozi
Copy link
Contributor Author

bekozi commented Mar 20, 2018

@jhamman I've made some progress on this: https://github.com/NCPP/ocgis/tree/i479-xarray. I need to add coords which may take a little bit. I did encounter one thing. Is it a known feature that data types in xarray or pandas are upcast to float if an integer array has a mask? See this code as an example. The test currently fails. What do you think?

@jhamman
Copy link

jhamman commented Mar 21, 2018

Cool! Glad to see this is moving forward.

Is it a known feature that data types in xarray or pandas are upcast to float if an integer array has a mask?

Yes. I think this issue is the closest open one to what you're running into: pydata/xarray#1194

I suspect we'll get there eventually but not particularly soon.

@bekozi
Copy link
Contributor Author

bekozi commented Mar 21, 2018

Yes. I think this issue is the closest open one to what you're running into: pydata/xarray#1194

Thanks! Next time I'll do the legwork on the issue search...

@bekozi
Copy link
Contributor Author

bekozi commented Mar 22, 2018

@jhamman I hit an issue in decode_cf. I worked a bit on constructing a Dataset using the ocgis interpretation of metadata, but I realized that using xarray's decoding facility would probably be better for xarray internals.

Anyway, I think the error is related to time bounds (it doesn't happen when I strip the time bounds variable from the dataset prior to decoding).

Error traceback:

======================================================================
ERROR: test_to_xarray (ocgis.test.test_ocgis.test_collection.test_field.TestField)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/benkoziol/Dropbox/NESII/project/ocg/git/ocgis/src/ocgis/test/test_ocgis/test_collection/test_field.py", line 455, in test_to_xarray
    xr = field.to_xarray()
  File "/home/benkoziol/Dropbox/NESII/project/ocg/git/ocgis/src/ocgis/collection/field.py", line 982, in to_xarray
    ret = decode_cf(ret)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/conventions.py", line 603, in decode_cf
    decode_coords, drop_variables=drop_variables)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/conventions.py", line 536, in decode_cf_variables
    stack_char_dim=stack_char_dim)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/conventions.py", line 471, in decode_cf_variable
    var = coder.decode(var, name=name)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/coding/times.py", line 348, in decode
    if 'units' in attrs and 'since' in attrs['units']:
TypeError: argument of type 'NoneType' is not iterable

This is the dataset to decode:

<xarray.Dataset>
Dimensions:      (bounds: 2, time: 3, x: 360, y: 180)
Coordinates:
  * x            (x) float64 -179.5 -178.5 -177.5 -176.5 -175.5 -174.5 ...
  * y            (y) float64 -89.5 -88.5 -87.5 -86.5 -85.5 -84.5 -83.5 -82.5 ...
  * time         (time) float32 1.0 2.0 3.0
Dimensions without coordinates: bounds
Data variables:
    xbounds      (x, bounds) float64 -180.0 -179.0 -179.0 -178.0 -178.0 ...
    ybounds      (y, bounds) float64 -90.0 -89.0 -89.0 -88.0 -88.0 -87.0 ...
    foo          (time, y, x) float32 12.999924 12.998706 12.996271 ...
    time_bounds  (time, bounds) float32 0.5 1.5 1.5 2.5 2.5 3.5

This time variable decodes okay:

<xarray.DataArray 'time' (time: 3)>
array([1., 2., 3.], dtype=float32)
Coordinates:
  * time     (time) float32 1.0 2.0 3.0
Attributes:
    axis:      T
    calendar:  standard
    units:     days since 0001-01-01 00:00:00
    bounds:    time_bounds

Here is the bounds variable causing an issue:

<xarray.DataArray 'time_bounds' (time: 3, bounds: 2)>
array([[0.5, 1.5],
       [1.5, 2.5],
       [2.5, 3.5]], dtype=float32)
Coordinates:
  * time     (time) float32 1.0 2.0 3.0
Dimensions without coordinates: bounds
Attributes:
    calendar:  standard
    units:     days since 0001-01-01 00:00:00

@jhamman
Copy link

jhamman commented Mar 23, 2018

I'm not able to reproduce this working off xarray@master. What happens if you try to decode the time_bounds variable by itself?

xr.decode_cf(ds['time_bounds'])

or if you try to set the time_bounds variable as a coordinate:

ds = ds.set_coords('time_bounds')
xr.decode_cf(ds)

For some reason, the attrs variable in decode() is None.

@bekozi
Copy link
Contributor Author

bekozi commented Mar 29, 2018

..sorry for the slow response...caught a human bug...

Thanks for trying to reproduce the issue. Looking at this again, this is something ocgis is introducing in the spatial coordinate bounds variable attributes during extrapolation. Here is the fix on this end. After changing this, the decoding works. Basically, it was setting the units on the bounds even if the units are none. So, now the conversion seems to be working!

I know we talked briefly about bounds variables and coordinates in xarray. The bounds are just extra variables in the dataset:

<xarray.Dataset>
Dimensions:             (bounds: 2, dim_0: 0, time: 3, x: 360, y: 180)
Coordinates:
  * x                   (x) float64 -179.5 -178.5 -177.5 -176.5 -175.5 ...
  * y                   (y) float64 -89.5 -88.5 -87.5 -86.5 -85.5 -84.5 ...
  * time                (time) object    1-01-02 00:00:00 ...
Dimensions without coordinates: bounds, dim_0
Data variables:
    xbounds             (x, bounds) float64 ...
    ybounds             (y, bounds) float64 ...
    foo                 (time, y, x) float32 ...
    ocgis_point         (y, x) object ...
    latitude_longitude  (dim_0) float64 ...
    time_bounds         (time, bounds) object ...

The coordinate system variable is also just transferred over with no special attributes. It would be possible to insert a PROJ or WKT description into its attributes. The ocgis_point variable is a 2D array of shapely point geometries.

I am thinking about cleaning this up a bit and pushing. Can you think of anything else that should be added in this first cut?

@jhamman
Copy link

jhamman commented Mar 30, 2018

I think this is a good start. I'm quite busy over the next few weeks and don't know if I'll have any real time to put it into action. I'm fine with your plan of providing a first cut as an experimental feature.

bekozi added a commit that referenced this issue Apr 3, 2018
Added first-cut "to_xarray" methods on dimensions, variables, collections,
and fields. Metadata handling is done by xarray using its "decode_cf"
capability. Limitations are listed in the documentation for the
"to_xarray" method on fields.
@bekozi
Copy link
Contributor Author

bekozi commented Apr 3, 2018

Initial capability is in master. I'm going to close this and as issues/features come up we can create new tix.

@bekozi bekozi closed this as completed Apr 3, 2018
@bekozi bekozi added this to the v2.2.0 milestone Apr 3, 2018
@bekozi
Copy link
Contributor Author

bekozi commented Sep 18, 2018

Reopening to capture internal changes related to xarray compatibility for xESMF spatial chunking and concurrent weight generation (JiaweiZhuang/xESMF#26, JiaweiZhuang/xESMF#29).

Major remaining issues are related to coordinate systems and spatial masking.

Branch: https://github.com/NCPP/ocgis/tree/i479-xarray-interop

@bekozi bekozi reopened this Sep 18, 2018
@bekozi
Copy link
Contributor Author

bekozi commented Jan 18, 2019

@huard, a good example to work from for xclim is this masking test for the xarray driver:

You can also use operations to subset the xarray field.

I expect a number of operations will fail as this is not exhaustively tested. There are numerous specialization points in the code that should be on the driver eventually. It's just a matter of collecting all the edge cases...

Time bounds calculations are performed when a grouping is computed to create a new time axis. The new_bounds variable contains the lower and upper bounds for each new climatology centroid as a (2, nt) numpy array.

Let me know if you have any questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants