Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP 4: Metadata Conventions #28

Merged
merged 2 commits into from
Jun 29, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions draft/ZEP0004.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
---
layout: default
title: ZEP0004
description: Metadata Conventions
parent: draft ZEPs
nav_order: 4
---

# ZEP 4 — Metadata Conventions

Authors:
* Ryan Abernathey ([@rabernat](https://github.com/rabernat)), Earthmover PBC

Status: Draft

Type: Process

Created: 2023-01-03

## Abstract

This ZEP describes how communities can standardize conventions around metadata and layout of Zarr data
using user-defined attributes in order to meet domain-specific application needs without changes to the
core data model and specification, and without specification extensions.

## Motivation and Scope

Zarr is a highly general array storage format. It is used widely in diverse domains, from genomics and microscopy to oceanography and climate science.
While all of these domains use multi-dimensional arrays, each has distinct practices around how data and metadata are stored, queried, and processed.
Supporting these domain-specific needs is out of scope for Zarr implementations; instead they should be met by higher-level libraries which may depend on Zarr for their I/O.

However, these domains may still benefit from standardization of the way their data and metadata are stored in Zarr.
For example, they may expect that
- arrays have specific names
- specific attributes be present in arrays and groups and conform to certain constraints
- groups are structured with a specific hierarchy, with defined relationships between different arrays in a group

Such standardization helps interoperability across different languages and packages.

This ZEP aims to formalize the concept of a Zarr **convention**, a mechanism for a domain community to
document a standard way of storing their specific data model in Zarr.
As of the time of this draft, many such _ad-hoc_ conventions are already in use across the Zarr landscape.
A goal of this ZEP is to formalize this existing practice and enhance coordination around conventions.

Conventions must fit completely within the Zarr data / metadata model of groups, arrays, and attributes thereof, requiring
no changes or extension to the specification.
A Zarr implementation itself should not even be aware of the existence of the convention.
The line between a convention and an extension may be blurry in some cases.
The key distinction lies in the implementation: the responsibility for interpreting a _convention_ relies completely with downstream,
domain-specific software, while an _extension_ must be handled by the Zarr implementation itself.
A good rule of thumb is that a user should be able to safely ignore the convention and still be able to interact with the data via the core Zarr library,
even if some domain-specific context or functionality is missing.
If the data are completely meaningless or unintelligible without the convention, then it should be an extension instead.

It seems likely that some features that are first supported as conventions may be promoted to extensions if they are deemed sufficiently general.

Conventions can also help users switch between different storage libraries more flexibly.
Since Zarr and HDF5 implement nearly identical data models, a single convention can be applied to both formats.
This allows downstream software to maintain better separation of concerns between storage and domain-specific logic.

Conventions are modular and composable. A single group or array can conform to multiple conventions.


## Usage and Impact

We demonstrate the usage and impact with a simple example: unit encoding.

A convention around unit encoding might look like this

> **Convention Name**: "units-v1"
>
> Information about the units of a Zarr array should be stored in the `units` attribute and encoded as strings using
> the [MIXF-08](http://people.csail.mit.edu/jaffer/MIXF/MIXF-08) standard.

To create data with this convention using zarr-python, one would write

```python
import zarr
z = zarr.open(
'example_with_units.zarr', mode='w', shape=(10000, 10000), chunks=(1000, 1000), dtype='f4'
)
z.attrs['units'] = 'm^2'
z.attrs['zarr_conventions'] = ["units-v1"]
```

Reading back the data with zarr-python would work just fine. However, reading the data with a unit-aware package would enable automatic decoding.
This is possible today with [Pint](https://pint.readthedocs.io/), [Xarray](https://xarray.pydata.org/), and [Pint-Xarray](https://pint-xarray.readthedocs.io/)

```python
import xarray as xr
import pint_xarray
ds = xr.open_dataset('example_with_units.zarr', engine='zarr').pint.quantify()
```

Now the variables have been decoded to unit-aware Pint arrays, based on the unit information encoded via the conventions.
Likewise, we can serialize the data using the convention back to Zarr automatically.

```python
ds.dequantify().to_zarr("array_with_units.zarr")
```

None of this requires any awareness of units on part of Zarr itself.

## Backward Compatibility

Conventions by definition are fully backwards and forwards compatible with all versions of Zarr, including both V2 and V3.
They require no changes to the spec or extensions and can be ignored by Zarr implementations.

Existing Zarr data conforming to _ad-hoc_ conventions that predate this ZEP can be supported via the **legacy conventions** mechanism described below.
New Zarr data written after this ZEP has been accepted should use the **explicit conventions** approach.

## Detailed description

This ZEP itself describes the structure and process by which conventions may be proposed, discussed, accepted, and published by the Zarr community.
This process is intended to be much more lightweight and informal than a spec change, which requires a ZEP.

Conventions will be described by a _convention document_.
Conventions documents will be added to the <https://zarr-specs.readthedocs.io/> website as a new top-level heading
and a corresponding folder will be created in the `zarr-developers/zarr-specs` repo.
A convention document template, which accompanies this ZEP, will be added to that folder.
The template is intentionally simple.

### Identifying a Convention

In its convention document, a must should declare itself as either an **explicit convention** or a **legacy convention**.

#### Explicit Convention

The preferred way of identifying the presence of a convention in a Zarr group or array is via the attribute `zarr_conventions`.
This attribute must be an array of strings; each string is an identifier for the convention.
Multiple conventions may be present.

For example, a group metadata JSON document with conventions present might look like this

```
{
"zarr_format": 3,
"node_type": "group",
"attributes": {
"zarr_conventions": ["units-v1", "foo],
rabernat marked this conversation as resolved.
Show resolved Hide resolved
}
}
```

where `units-v1` and `bar` are the convention identifiers.

#### Legacy Convention

A legacy convention is a convention already in use that predates this ZEP.
Data conforming to legacy conventions will not have the `zarr_conventions` attribute.
The conventions document must therefore specify how software can identify the presence of the convention through a series of rules or tests.

For those comfortable with the terminology, legacy conventions can be thought of as a "conformance class" and a corresponding "conformance test".

An example of a legacy convention might be existing Zarr data written following [CF Conventions](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html).
Such data will have the group attribute `conventions` set to the value `CF-1.10` (or perhaps a different version number).
This forms the basis for a test for whether the group conforms to the convention.

#### Namespacing

Conventions may choose to store their attributes on a specific namespace.
This ZEP does not specify how namespacing works; that is up to the convention.
For example, the namespace may be specified as a prefix on attributes, e.g.

```
{
"attributes": {"units-v1:units": "m^2"}
}
```

or via a nested JSON object, e.g.

```
{
"attributes": {"units-v1": {"units: "m^2"}}
}
```

The use of namespacing is optional and is up to the convention to decide.

#### Versioning

There may be multiple versions of a convention.
It is recommended for a convention to explicitly declare its version.
For an explicit convention, the version identifier may be encoded into the convention identifier string, but this is not required.
The convention document should specify how to identify the convention version.

### New Convention Process

New conventions are proposed via a pull-request to the `zarr-specs` repo which adds a new conventions document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what the pull-request contains (a domain specific convention document ?). If it contains a convention document, a template should describe its structure ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a template is needed. I need to finish this PR.

If the convention is already documented elsewhere, the convention document can just contain a reference to the external documentation.
The author of the PR is expected to convene the relevant domain community to review and discuss the ZEP.
This includes posting a link to the PR on relevant forums, mailing lists, and social-media platforms.

The goal of the discussion is to reach a _consensus_ among the domain community regarding the convention.
The Zarr steering council, together with the PR author, will determine if a consensus has been reached, at which point the PR
can be merged and the convention published on the website.
If a consensus cannot be reached, the steering council may still decide to publish the convention, accompanied by a
disclaimer that it is not a consensus, and noting any objections that were raised during the discussion.

It is also possible that multiple, competing conventions exist in the same domain. While not ideal, it's not up to
the Zarr community to resolve such domain-specific debates.
These conventions should still be documented in a central location, which hopefully helps move towards alignment.

### Updating a Convention

Conventions should be versioned using incremental integers, starting from 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, reading these two sentences leaves a certain tension. Can we clarify if the first is really a SHOULD or is it more a MAY?

Or, if the community already has an existing versioning system for their convention, that can be used instead (e.g. CF conventions).
The community is free to update their convention via a pull request using the same consensus process described above.
The conventions document should include a changelog.
Details of how to manage changes and backwards compatibility are left to the domain community.

## Related Work

This proposal is inspired heavily by the way Zarr is already used today in the weather and climate community.
The Xarray software package is built around the [NetCDF Data Model](https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html).
The original netCDF4 format is implemented on top of HDF5.
When implementing a Zarr reader / writer for Xarray, the developers found a way to map the NetCDF data model to Zarr without requiring any changes to Zarr itself. This is a de-facto convention.

The mapping between Zarr and NetCDF enables further layering of conventions. Specifically, the Climate-Forecast conventions
([CF Conventions](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html)) sit on top of the NetCDF data model.
It's useful to quote from the CF conventions, as they provide a good summary of the value of conventions more broadly:

> The netCDF interface enables but does not require the creation of self-describing datasets. The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the sense that each variable in the file has an associated description of what it represents, including physical units if appropriate, and that each value can be located in space (relative to earth-based coordinates) and time.
>
> An important benefit of a convention is that it enables software tools to display data and perform operations on specified subsets of the data with minimal user intervention. It is possible to provide the metadata describing how a field is located in time and space in many different ways that a human would immediately recognize as equivalent. The purpose in restricting how the metadata is represented is to make it practical to write software that allows a machine to parse that metadata and to automatically associate each data value with its location in time and space. It is equally important that the metadata be easy for human users to write and to understand.

By separating the conventions from the underlying storage container, we enable more modularity in a domain's software ecosystem.
With conventions, Zarr can be substituted more easily for HDF5 with minimal changes to the rest of the software stack.

There are already many conventions operating today in Zarr. Prominent examples include

- [Open Microscopy Next Generation File Format](https://ngff.openmicroscopy.org/) (OME-NGFF) and its implementation in the [ome-zarr-py](https://ome-zarr.readthedocs.io/) package
- The [Xarray Zarr dimension encoding](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html) specification, implemented as part of the mapping of Zarr to NetCDF
- The [GDAL Zarr Driven](https://gdal.org/drivers/raster/zarr.html) and its encoding of spatial coordinate reference systems in user attributes
- The proposed [GeoZARR Spec](https://github.com/christophenoel/geozarr-spec). This spec is incompatible with GDAL's implementation
(see [github issue](https://github.com/christophenoel/geozarr-spec/issues/3)), which highlights the need for a central repository of conventions.

## Implementation

Implementation of this ZEP requires only changes to the zarr-specs website.

## Alternatives

The main alternative to the concept of conventions is the concept of an optional extension, discussed in ZEP0001.
All of the features described above could be implemented as spec extensions instead.
Then the responsibility for parsing the extensions and implementing the expected features
(e.g. creating geographical coordinate reference systems, decoding units) would have to be handled by the Zarr implementation.
This would lead to potentially unlimited scope increases for the zarr core libraries.

## Discussion

## License

<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license"
href="http://creativecommons.org/publicdomain/zero/1.0/">
<img src="https://licensebuttons.net/p/zero/1.0/80x15.png" style="border-style: none;" alt="CC0" />
</a>
<br />
To the extent possible under law,
<a rel="dct:publisher"
href="https://github.com/zarr-developers/zeps">
<span property="dct:title">the authors</span></a>
have waived all copyright and related or neighboring rights to
<span property="dct:title">ZEP 4</span>.
</p>