Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store axes as _ARRAY_DIMENSIONS #166

Closed
sbesson opened this issue Feb 4, 2022 · 11 comments
Closed

Store axes as _ARRAY_DIMENSIONS #166

sbesson opened this issue Feb 4, 2022 · 11 comments
Labels
invalid This doesn't seem right

Comments

@sbesson
Copy link
Member

sbesson commented Feb 4, 2022

Noticed while reviewing glencoesoftware/bioformats2raw#121, the ome_zarr.writer API currently does not store the axes metadata as _ARRAY_DIMENSIONS under the .zattrs of each resolution array.

Since this is a requirement of the spec,introduced in 0.3 - see https://ngff.openmicroscopy.org/0.3/#multiscale-md, the relevant writer APIs should be updated before releasing 0.3.0.

@joshmoore
Copy link
Member

👍 but if I remember correctly the status of my testing showed that attempting to access the zgroup with xarray led to an exception. The workaround suggested by @aurghs and @alexamici was to use differently named axes per resolution level > 0.

see: zarr-developers/zarr-specs#125
cc: @constantinpape

@sbesson
Copy link
Member Author

sbesson commented Feb 4, 2022

Cheers, I wasn't aware of the limitation and that's definitely something that should be tested with the writer addition.

If some the axes names needs to be resolution-aware in the _ARRAY_DIMENSIONS (I could imagine an implementation as simple as suffixing them with the resolution index i.e.x_1, x_2), I suspect this is something we should formalize in the specification sooner rather than later given that this metadata is a requirement.

@constantinpape
Copy link
Contributor

+1 but if I remember correctly the status of my testing showed that attempting to access the zgroup with xarray led to an exception. The workaround suggested by @aurghs and @alexamici was to use differently named axes per resolution level > 0.

Is there a concrete issue for this? I couldn't find any mention of this problem in the one you linked (zarr-developers/zarr-specs#125).

@will-moore
Copy link
Member

I'm still not entirely clear from the NGFF spec and the xarray page whether _ARRAY_DIMENSIONS goes on the top .zattrs alongside "multiscales" (and does xarray know how to find the arrays?) or does it go in the .zarray? It's not in any of the sample JSON on the spec page.
Ah - re-reading... does it go in a .zattrs (which we don't create currently) alongside the .zarray?

@sbesson
Copy link
Member Author

sbesson commented Feb 4, 2022

Ah - re-reading... does it go in a .zattrs (which we don't create currently) alongside the .zarray?

Yes exactly, this metadata is expected to be defined in the array attributes

@joshmoore
Copy link
Member

Is there a concrete issue for this?

Apologies, @constantinpape, no. This happened to date on the Zarr side with setting up the funding and getting B-Open up to speed.

@sbesson
Copy link
Member Author

sbesson commented Feb 7, 2022

I was able to reproduce the error with the following local changes

(base) sbesson@ls30630:ome-zarr-py (xarray_investigation) $ git diff  --cached
diff --git a/.isort.cfg b/.isort.cfg
index b3d36cb..367ece7 100644
--- a/.isort.cfg
+++ b/.isort.cfg
@@ -1,5 +1,5 @@
 [settings]
-known_third_party = dask,numpy,pytest,scipy,setuptools,skimage,zarr
+known_third_party = dask,numpy,pytest,scipy,setuptools,skimage,xarray,zarr
 multi_line_output = 3
 include_trailing_comma = True
 force_grid_wrap = 0
diff --git a/ome_zarr/writer.py b/ome_zarr/writer.py
index 83bbe6d..497ed32 100644
--- a/ome_zarr/writer.py
+++ b/ome_zarr/writer.py
@@ -206,8 +206,9 @@ def write_multiscale(
     datasets: List[dict] = []
     for path, data in enumerate(pyramid):
         # TODO: chunks here could be different per layer
-        group.create_dataset(str(path), data=data, chunks=chunks)
+        dataset = group.create_dataset(str(path), data=data, chunks=chunks)
         datasets.append({"path": str(path)})
+        dataset.attrs["_ARRAY_DIMENSIONS"] = [x["name"] for x in axes]
 
     if coordinate_transformations is None:
         shapes = [data.shape for data in pyramid]
diff --git a/requirements/requirements-test.txt b/requirements/requirements-test.txt
index 9cf8484..cc5774c 100644
--- a/requirements/requirements-test.txt
+++ b/requirements/requirements-test.txt
@@ -1,3 +1,4 @@
 pytest
 pytest-cov
 codecov
+xarray
diff --git a/tests/test_xarray.py b/tests/test_xarray.py
new file mode 100644
index 0000000..187681f
--- /dev/null
+++ b/tests/test_xarray.py
@@ -0,0 +1,14 @@
+import pytest
+import xarray
+
+from ome_zarr.data import create_zarr
+
+
+class TestXarray:
+    @pytest.fixture(autouse=True)
+    def initdir(self, tmpdir):
+        self.path = tmpdir.mkdir("data")
+        create_zarr(str(self.path))
+
+    def test_open_zarr(self):
+        xarray.open_zarr(str(self.path))

Running tox -e py38 tests/test_xarray.py fails with ValueError: conflicting sizes for dimension 'y': length 812 on '1' and length 1624 on {'y': '0', 'x': '0'}

The workaround mentioned in #166 (comment) i.e. suffixing the axes name using the resolution level is sufficient to let this simple test pass:

diff --git a/ome_zarr/writer.py b/ome_zarr/writer.py
index 497ed32..c5438a3 100644
--- a/ome_zarr/writer.py
+++ b/ome_zarr/writer.py
@@ -208,7 +208,7 @@ def write_multiscale(
         # TODO: chunks here could be different per layer
         dataset = group.create_dataset(str(path), data=data, chunks=chunks)
         datasets.append({"path": str(path)})
-        dataset.attrs["_ARRAY_DIMENSIONS"] = [x["name"] for x in axes]
+        dataset.attrs["_ARRAY_DIMENSIONS"] = ["%s_%s" %(x["name"], path) for x in axes]
 
     if coordinate_transformations is None:
         shapes = [data.shape for data in pyramid]

Discussing briefly with @joshmoore, I feel like we need to make a decision before we release OME-NGFF 0.4 and implement the feature across libraries. I can see three viable options:

  1. keep the specification as it currently stand i.e. https://github.com/ome/ngff/blob/12acbb69dba91da09f162eb806c59161601dc1c0/latest/index.bs#L267-L269, implement this feature across the board and file this as an xarray bug which needs to be upstream to support this metadata-style
  2. update the specify to mandate an alternate form like the one proposed above for storing xarray-compatible metadata in multiscales OME-NGFF datasets and update the implementations accordingly
  3. keep investigating options and solutions but reduce the the xarray-compatibility requirement from MUST to either SHOULD or even MAY level.

/cc @aurghs @alexamici @constantinpape happy to discuss on the image.sc zulip or get on a call if that would help resolving this outstanding issue.

@constantinpape
Copy link
Contributor

@joshmoore @sbesson I am still missing some context here about what the actual problem is. Could you please give me a few line TLDR?

@constantinpape
Copy link
Contributor

After some discussions on zulip: let's remove _ARRAY_DIMENSIONS for now and follow up with xarray to see how we can achieve compatability.

@sbesson
Copy link
Member Author

sbesson commented Feb 8, 2022

Closing for now, the requirement has been removed from the 0.4 specification. Various layout currently allow to use xarray together with OME Zarr datasets. Hopefully the discussion in ome/ngff#48 will converge towards a set of recommendations that can be reintroduced later allowing compatibillity.

@sbesson sbesson closed this as completed Feb 8, 2022
@alexamici
Copy link

Sorry for joining the party late. @aurghs and I are still getting up to speed with the microscopy data formats, we come from geospatial / satellite (GeoTIFF, HDF5, etc) and climate / scientific (netCDF / CF Conventions, GRIB, etc).

As identified in our first call with @joshmoore the main issues with in the interaction between the NGFF spec and Xarray is that Xarray Zarr backend assume a few structural features for the Zarr store that make it similar to a netCDF, and the NGFF appears to break some of those assumptions.

As soon as we feel comfortable with the microscopy data format we will engage the community more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

5 participants