-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZEP 4: Metadata Conventions #28
Conversation
cc @briannapagan (in case this is of interest to you) |
This is directly relevant to the forthcoming geozarr work, so that's why I wanted to push it out in draft form |
Awesome, thanks a lot! As mentioned in zarr-developers/zarr-specs#169, this is also relevant for the issues zarr-developers/zarr-specs#139 and zarr-developers/zarr-specs#144. I like that this is becoming a separate ZEP, it never occurred to me to separate this from ZEP 1. |
The current proposal just allows a group/array to have a single convention. Perhaps for some use cases that makes sense. But the example in the proposal is "units", which could easily interoperate with numerous other possible conventions. Instead the naming of attributes could be done to allow multiple conventions to be used at once, for example: {"zarr_convention": {"units-v1": {"units": "m^2"}, ...} or {"units-v1": {"units": "m^2"}, ...} or {"units-v1": "m^2", ...} |
Very good point Jeremy. TBH, I'm on the fence about whether the convention even needs to be explicitly identified. Like, maybe it could be enough to say
Of the proposals above, I definitely favor the first one because it doesn't touch the name of the actual attribute. I could also imagine
|
Strongly support this concept. Question: |
I do believe it's useful that, once a convention is listed and given a name, it is explicitly mentioned in the attributes of the data that uses it. |
I think that it's natural for conventions to arise organically. Once there is sufficient alignment and adoption, they can be proposed as conventions. For conventions created in the wild, or borrowed from other formats (e.g. CF Conventions), it could be hard to require the presence of the |
Recommended, but not required? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, @rabernat! I very much look forward to having the conventions list on zarr-specs. 👍
|
||
### Updating a Convention | ||
|
||
Conventions should be versioned using incremental integers, starting from 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, reading these two sentences leaves a certain tension. Can we clarify if the first is really a SHOULD or is it more a MAY?
draft/ZEP0004.md
Outdated
'example_with_units.zarr', mode='w', shape=(10000, 10000), chunks=(1000, 1000), dtype='f4' | ||
) | ||
z.attrs['units'] = 'm^2' | ||
z.attrs['zarr_convention'] = "units-v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, this is the most (or really only) potentially contentious line in the ZEP. Similar to the ZEP1 discussion last night, I could see discussing at least:
- if
zarr_
is a generally special prefix or ifzarr_convention
is a special string - if it's possible to have more than one convention active in a zgroup/zarray at a time
- whether other values (like URLs) are acceptable as the values of the convention string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering if this is needed at all. Instead, I'd rather just specify that the field units
has a metadata convention. Nobody is forced to follow this convention, and exisiting arrays with such a field would potentially follow it automatically.
I'd propose to remove the {
"units": "m^2",
"writer": "zarr-python",
"origin": [12300, 45600],
"convention-key": "valuenotfollowingtheconvention",
"some-other-key": "foo",
…
} I see the conventions as a good place to discuss & establish standards between implementations, not as a strict mechanism that must be enforced. Also, IMO it can't be enforced, having the |
About specifying the convention: I think it's really quite useful to know which specific convention is being used and to version them. For example, if two groups want to use the In I'd also agree with the point made above that only allowing a single convention may be limiting. Maybe instead conventions could be stored like:
|
I tend to agree with Isaac. Maybe it helps to think through (and/or specify) what processing looks like. I've been debating whether to bring up my favorite soapbox (JSON-LD) as a way of specifying such metadata that already exists, has processing rules, etc. e.g.:
Obviously, there are a lot of different edge cases there, but I do like the idea of not building our own. |
From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers. A big use case for us is transcoding NetCDF / HDF5 data that already exists into Zarr. This data was written 10+ years ago and the metadata is what it is. I think way forward is for me to put together the template referred to above. This should have a section on "How to identify this convention". |
Is it not an option to convert the metadata at the same time as the data conversion happens? It seems that during this data conversion is when you would have the most context for decoding any metadata. |
Totally fair. Without any sort of identifier or format requirement this seems to me like a listing of conventions used with zarr. If this is the direction, I wonder if even the "consensus" requirement for new conventions could be softened to "noteworthy" or removed. If no namespace is being reserved, then I don't think the zarr team needs to take on the responsibility of figuring out if a field has consensus on a file format. Especially since it's so easy to break consensus. For instance, of the given examples aren't Xarray Zarr, GDAL, and GeoZarr competing? I think there's definitely value in collecting lists of conventions building on top of zarr, and making that visible. However, I wonder if doing more than that (like establishing credentials based on consensus/ use) is something better left to standards repositories like fairsharing.org? |
Really good points @ivirshup. Yes, we definitely don't want to give ourselves (zarr developers) the job of mediating standards in different scientific domains. The only intention here is to provide a means to document an existing convention, not do any sort of evaluation or approval. I'll modify to reflect that. |
On my end, I'm most interested in attributes that are relevant to general purpose tools like Neuroglancer, e.g. things like units, different types of labels. If there is no way to unambiguously identify the metadata then it is much more complicated to make use of it. |
I think namespacing would be a good idea. In the OME-Zarr context, we are thinking about wrapping all the OME-specific metadata under the |
I didn't realised it was a pull request sorry :) :) |
My main concern is the lack of a concept for grouping conventions for a specific purpose, profile, or topic. This would provide greater flexibility for client applications to selectively support conventions, while still enabling interoperability with other Zarr implementations. In some domains (e.g. Earth Observation), there may be hundreds of conventions, and client applications may only address subsets of those conventions. Similar to how OGC APIs Implementation Standards are written, I suggest introducing "requirement-classes" (or "convention-classes") as a means of decoupling a set of domain conventions into groups that can be advertised in Zarr as supported or not. For example:
Furthermore, I do not see a clear indication in the process of how the conventions become listed on https://zarr-specs.readthedocs.io/ under a specific domain section. Regards, |
|
||
### New Convention Process | ||
|
||
New conventions are proposed via a pull-request to the `zarr-specs` repo which adds a new conventions document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what the pull-request contains (a domain specific convention document ?). If it contains a convention document, a template should describe its structure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a template is needed. I need to finish this PR.
Thanks for everyone's patience, and apologies for being slow to finish up this draft. I plan to prioritize this over the next few weeks. I'll respond to the comments above and push a new draft that incorporates the feedback. |
I have finally updated this ZEP. Thanks everyone for the patience. In my update, I incorporated the following changes
My goal here is to include the very good ideas that have been proposed in the discussion above as recommended best practices while retaining the ability to support legacy conventions and practices already in use in the community. |
Co-authored-by: Norman Rzepka <[email protected]>
Thanks for completing it, @rabernat, and everyone for reviewing this. ZEP0004 is live here: https://zarr.dev/zeps/draft/ZEP0004.html. |
@MSanKeys963, did a discussion get opened for this? |
Hi Isaac! I believe it's on me to open a PR where the discussion will happen. I will try to do that today. |
👍 Is it not meant to be a Discussion (as opposed to a PR/ Issue)? I think this kind of discussion heavily benefits from threading. Maybe this changed: #27 ? |
TBH I think the ZEP process still has a lot of details to be ironed out. I agree a discussion makes sense. |
This idea is excellent; I would love to help push this forward. What is the best way to collaborate? |
@tasansal - the discussion is continuing in zarr-developers/zarr-specs#262 The best way to collaborate would be to share your use cases there. |
This ZEP describes how communities can standardize conventions around metadata and layout of Zarr data
using user-defined attributes in order to meet domain-specific application needs without changes to the
core data model and specification, and without specification extensions.