Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Having a custom engine for open_mfdatatree #55

Closed
mraspaud opened this issue Dec 21, 2021 · 8 comments
Closed

Having a custom engine for open_mfdatatree #55

mraspaud opened this issue Dec 21, 2021 · 8 comments
Labels
IO Representation of particular file formats as trees

Comments

@mraspaud
Copy link

Hi @TomNicholas !

I am one of the core devs of satpy (https://github.com/pytroll/satpy), which makes use of xarray/dask to handle satellite data for earth-observing satellites.
In this context, we have many times satellite data which have different resolutions for a same dataset, hence xarray's dataset can't really be used for these data, as the coords for the different variables don't match, and DataTree makes a lot of sense for us.

The satellite data, more often than not, is in some binary format, and we read it and convert it to xarray.DataArrays, and I'm now started experimenting placing them in a DataTree by hand.
So it would be really nice if there was an interface for adding custom engines to read that data (multiple files). Did you already consider that? Do you maybe already have an idea on how this would work?

We have been wanting to stick closer to the data model of xarray in our library, and datatree looks like something we could really use :) let's hope we can contribute here, at least with ideas in the future.

@TomNicholas
Copy link
Member

TomNicholas commented Dec 21, 2021

Hi @mraspaud , thanks so much for your interest!

So it would be really nice if there was an interface for adding custom engines to read that data (multiple files).

Some initial thoughts:

  1. Can you already open one DataArray/Dataset by hand with open_dataset/open_datatree from your data format? Then you could pretty easily write your own open function to stack all of those into a tree. Given the prototype status of datatree, this might be the best option for now. (If not then for an example of opening custom binary formats into xarray you might be interested in xmitgcm.)

  2. Can you already plug a custom engine for your data into open_dataset? Perhaps that interface can be extended to handle multiple files...

  3. The next big step for me with DataTree is to write a detailed design doc, and then get input from potential users like you, before rewriting datatree and eventually integrating into xarray upstream. This would be a great point to really hash out the details of an interface to read data from multiple files.

Tagging @jhamman for his backends expertise too!

EDIT: Related to #51

@mraspaud
Copy link
Author

  1. Yes, I have done that and it works fine.

Eg

 DataTree('root')
 ├── DataTree('3000')
 │   Dimensions:  (y: 3712, x: 3712)
 │   Coordinates:
 │       crs      object PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",E...
 │     * y        (y) float64 -5.566e+06 -5.563e+06 -5.56e+06 ... 5.566e+06 5.569e+06
 │     * x        (x) float64 5.566e+06 5.563e+06 5.56e+06 ... -5.566e+06 -5.569e+06
 │   Data variables:
 │       VIS006   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       VIS008   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_016   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_039   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       WV_062   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       WV_073   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_087   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_097   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_108   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_120   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │       IR_134   (y, x) uint16 dask.array<chunksize=(464, 3712), meta=np.ndarray>
 │   Attributes:
 │       SatelliteStatus:              {'SatelliteDefinition': {'SatelliteId': 324...
 │       ImageAcquisition:             {'PlannedAcquisitionTime': {'TrueRepeatCycl...
 │       CelestialEvents:              {'CelestialBodiesPosition': {'PeriodTimeSta...
 │       ImageDescription:             {'ProjectionDescription': {'TypeOfProjectio...
 │       RadiometricProcessing:        {'RPSummary': {'RadianceLinearization': arr...
 │       GeometricProcessing:          {'OptAxisDistances': {'E-WFocalPlane': arra...
 │       15TrailerVersion:             0
 │       ImageProductionStats:         {'SatelliteId': 324, 'ActualScanningSummary...
 │       NavigationExtractionResults:  {'ExtractedHorizons': {'HorizonId': array([...
 │       RadiometricQuality:           {'L10RadQuality': {'FullImageMinimumCount':...
 │       GeometricQuality:             {'AbsoluteAccuracy': {'QualityInfoValidity'...
 │       TimelinessAndCompleteness:    {'Timeliness': {'MaxDelay': 20.589, 'MinDel...
 └── DataTree('1000')
     Dimensions:  (y: 11136, x: 11136)
     Coordinates:
         crs      object PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",E...
       * y        (y) float64 -5.566e+06 -5.565e+06 -5.564e+06 ... 5.57e+06 5.571e+06
       * x        (x) float64 5.566e+06 5.565e+06 5.564e+06 ... -5.57e+06 -5.571e+06
     Data variables:
         HRV      (y, x) uint16 dask.array<chunksize=(464, 1804), meta=np.ndarray>
     Attributes:
         SatelliteStatus:              {'SatelliteDefinition': {'SatelliteId': 324...
         ImageAcquisition:             {'PlannedAcquisitionTime': {'TrueRepeatCycl...
         CelestialEvents:              {'CelestialBodiesPosition': {'PeriodTimeSta...
         ImageDescription:             {'ProjectionDescription': {'TypeOfProjectio...
         RadiometricProcessing:        {'RPSummary': {'RadianceLinearization': arr...
         GeometricProcessing:          {'OptAxisDistances': {'E-WFocalPlane': arra...
         15TrailerVersion:             0
         ImageProductionStats:         {'SatelliteId': 324, 'ActualScanningSummary...
         NavigationExtractionResults:  {'ExtractedHorizons': {'HorizonId': array([...
         RadiometricQuality:           {'L10RadQuality': {'FullImageMinimumCount':...
         GeometricQuality:             {'AbsoluteAccuracy': {'QualityInfoValidity'...
         TimelinessAndCompleteness:    {'Timeliness': {'MaxDelay': 20.589, 'MinDel...
 Load time: 0:00:03.257307
  1. No, I haven't tested that yet as most formats have multiple interdependent files, so I didn't investigate the single file option yet.

  2. Sounds good, we'll be happy to provide feedback!

@jhamman
Copy link

jhamman commented Jan 3, 2022

  1. Can you already open one DataArray/Dataset by hand with open_dataset/open_datatree from your data format? Then you could pretty easily write your own open function to stack all of those into a tree. Given the prototype status of datatree, this might be the best option for now. (If not then for an example of opening custom binary formats into xarray you might be interested in xmitgcm.)

+1 on this being the current recommendation. Hierarchical datasets conform to a number of semantic linking conventions and, at least at this point, I would recommend writing custom openers for each dataset/convention. I think we'll learn a lot from the implementation of these custom openers, and as @alexamici mentions in pydata/xarray#1982, there are some emerging standards that we may be able to leverage is some generic openers.

@TomNicholas TomNicholas added the IO Representation of particular file formats as trees label May 18, 2022
@mgrover1
Copy link
Contributor

mgrover1 commented Sep 22, 2022

Hey ya'll (@TomNicholas )- we have some custom engines for radar data in our xradar package, where we can read data using the following:

import xarray as xr
import xradar

ds = xr.open_dataset("radar_file.nc", group='sweep_0', engine='cfradial1')

but we cannot use this engine with datatree directly yet since it is not one of the registered engines

import datatree as dt

dt.open_datatree("radar_file.nc", engine='cfradial1')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [6], line 1
----> 1 dt.open_datatree(filename, engine='cfradial1')

File ~/miniforge3/envs/xradar-dev/lib/python3.10/site-packages/datatree/io.py:60, in open_datatree(filename_or_obj, engine, **kwargs)
     58     return _open_datatree_netcdf(filename_or_obj, engine=engine, **kwargs)
     59 else:
---> 60     raise ValueError("Unsupported engine")

ValueError: Unsupported engine

What is the best way of adding our new engines so we can load these datasets into a datatree?

Here is a full example with our working functionality and API

@TomNicholas
Copy link
Member

Hi @mgrover1!

Quick Q: If the file is .nc then what is your custom engine doing?

What is the best way of adding our new engines so we can load these datasets into a datatree?

The most general way would be to extend xarray's backend entrypoint system to support open_datatree, but we can't do this until datatree is integrated in xarray upstream.

In the meantime I guess we could add another special case to datatree/io.py? Unless you have another suggestion?

@mgrover1
Copy link
Contributor

@TomNicholas - though these files are netcdf, they are a specific type of netcdf (cfradial) this has additional hierarchal metadata that we then use to parse into groups and such. Also, this is just one of the files supported by the package. Other readers include cfradial2 and odim_h5. We plan on adding several other readers too.

@aladinor
Copy link

This issue is related to adding a new backend to open cfradial files (weather radar files). I think there is an implementation here

https://github.com/openradar/xradar/blob/bd774ba7aec767db78c8c9035518010e351de1d7/xradar/io/backends/cfradial1.py#L303

Do you think @mgrover1 or @kmuehlbauer we can close this?

@eni-awowale
Copy link

This seems to have been resolved so I will go ahead and close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
IO Representation of particular file formats as trees
Projects
None yet
Development

No branches or pull requests

6 participants