-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-2: Zarr v3 #227
RFC-2: Zarr v3 #227
Conversation
Automated Review URLs |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: |
I think moving to zarr v3 is a great step. I would be happy to "endorse" this RFC. Thanks, @normanrz ! |
Does the sharding codec need to be detailed here or can we just name-drop it as an advantage of zarr v3 and then link to zarr's information about it? There are a couple of things which could change in NGFF to make it more zarr-y. They're not blockers to adoption, just something to reduce downstream users' headaches, which IMO will never be addressed if not now. One is the the case convention: zarr uses The other is how enums are tagged: zarr uses adjacently tagged ( |
These are both very good points -- on the latter, you might want to weigh in over at #138, where a lot of new enums are being minted. |
I think sharding is a major motivation to move to v3. That is why it gets so much space in my proposal. It is intended as background information and, in the end, has no direct implications on the OME-Zarr spec.
Pinging @joshmoore for advice on whether that should be a separate RFC. I think it is possible to bundle multiple RFCs into one new version of the spec. |
Thanks @normanrz ! I'm also happy to endorse this RFC! : |
Also happy to endorse, this would be very beneficial for our use cases, huge thanks @normanrz ! |
Thanks @normanrz, I am also happy to endorse this RFC. |
Thanks a lot @normanrz , I am also happy to endorse this RFC! It will be great to get sharding for OME-Zarrs. |
I also endorse this RFC! |
I'm happy to endorse this RFC! 👍 |
Obviously I'm all in favour of supporting v3, but:
What is the motivation for this? Why should we couple ome-zarr and zarr so tightly? If someone has ome-zarr v0.5 (or whatever) metadata at the root of a v2 zarr folder, why should that be forbidden? |
Gaaaah, don't comment before reading the article! 😅 I just read:
which explains it. However, it does still seem lightweight to support both on the ome side? |
While it is easy to support both versions in the OME spec document, I'm concerned with the complexity burden for implementations. I'd rather not add one dimension to the compatibility matrix. From a OME-Zarr user perspective, the hard cut also makes things simpler: ≤ 0.5 => Zarr v2; > 0.5 => Zarr v3 (or whatever the version number will be). If users wish to upgrade their data from one OME-Zarr version to another, it would be easy to also migrate the core Zarr metadata to v3. This is a fairly cheap operation, because only json files are touched. Zarr v2 and v3 metadata could even live side-by-side in the same hierarchy. There are functions available that can migrate the metadata automatically (e.g. in zarrita and soon zarr-python). |
Sure, I guess it is indeed easier as a user to know if you have an ome-zarr v0.5 file that all readers would support it, rather than have to understand whether your reader supports zarr v2. It would indeed be easy to get into a situation like "does this USB cable support data transfer and at what speed?" 😅 Anyway, please take my question as more for my own information rather than as a blocker: I too am happy to endorse this plan. 😊 |
The current plan is definitely to collect multiple RFCs into a single spec version, but that being said, I personally don't think sharding needs a separate RFC. With the move to v3, we have the chance to decouple the NGFF specific from specifics of the backend (looking at you, dimension separator), so I agree with @clbarnes that should be more referencing an existing spec (ZEP1 & ZEP2) but we do need to make users & implementers aware of the trade-offs that a given backend provides them. It's going to be a fine balance.
I still need to go through the text, but a 👍 for including any explanations you give here in the main text if they are not already there. |
@normanrz gently pointed out that I had misunderstood his question. The point was whether or not this RFC should include issues about camelCasing, etc. I would think not. Judging simply be the amount of endorsers this has already received, adding more issues especially ones that are prone to bike shedding can only make things less clear. (And for such topics, it likely makes sense to do some consensus building outside of the RFC before bringing it for review (D2 "gather support") but that's beyond the scope of this PR. 😄) |
Thanks for the endorsements and feedback! I moved the proposal to draft state D3, which is intended to clarify questions before the review. Any feedback and questions are appreciated. More endorsements are also very welcome! Based on the feedback, I moved the sections about sharding to "Background" (because it is not really part of this RFC; just illustrates a motivation for adopting v3) and added the motivation for dropping v2 support in the text. |
Co-authored-by: Yaroslav Halchenko <[email protected]>
+1 to the chorus of endorsements. I am looking forward to using Zarr v3 with several of my hats. |
Thanks @normanrz! I endorse this RFC. |
rfc/2/index.md
Outdated
## Proposal | ||
|
||
This RFC proposes to adopt version 3 of the Zarr format for OME-Zarr. | ||
Version 2 will no longer be supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what sense? What will happen with data already in zarr v 2 - will it remain at least readable in foreseeable future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on the implementations. If libraries continue to support OME-Zarr ≤ 0.5, Zarr 2 will remain accessible. As I noted in another section of the proposal, it is fairly easy to migrate Zarr v2 to v3 without rewriting the chunk data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording remains unclear -- "no longer be supported" by what? by libraries? by standard itself?
if "easy" -- why not to be instructed/recommended to be done on the fly by the libraries?
My motivation: on dandiarchive.org we already have many TBs of OME-Zarr v2. Even if it is "easy" to convert, once data is published, and DOI is minted, data must not mutate.
Ideally a standard should remain stable enough to allow access to data for years coming. Otherwise it would be hard(er) to convince people to place data in such a "standard form" which would not be accessible within just a handful of years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording remains unclear -- "no longer be supported" by what? by libraries? by standard itself?
This means that data that uses OME-Zarr version 0.6 (or whatever the version number ends up being for this RFC) or higher will have to use Zarr v3 instead of v2. Data that uses earlier versions of OME-Zarr are not affected by this requirement.
if "easy" -- why not to be instructed/recommended to be done on the fly by the libraries?
My hope is that implementations still continue to support a wide range of OME-Zarr versions, including the current 0.4. In the end that is up to each implementation to decide. I think this table should get another dimension for the OME-Zarr version to track implementation support. I don't think it is in scope for this RFC to make recommendations about support for earlier versions of the spec, though.
My motivation: on dandiarchive.org we already have many TBs of OME-Zarr v2. Even if it is "easy" to convert, once data is published, and DOI is minted, data must not mutate.
Ideally a standard should remain stable enough to allow access to data for years coming. Otherwise it would be hard(er) to convince people to place data in such a "standard form" which would not be accessible within just a handful of years.
Since the data in an archive is immutable, it wouldn't be upgraded to new versions of OME-Zarr. That would mean it is not affected by this RFC. If new data is published under the new version of OME-Zarr, it needs to use Zarr v3 under the hood.
There is a case to be made to abandon version numbers and require backwards compatibility on a spec level. Personally, I think that is too early for OME-Zarr. That is something a later RFC could pick up.
|
||
While the adoption of Zarr v3 does not strictly require changes to the OME-Zarr metadata, this proposal contains changes to align with community conventions and ease implementation: | ||
|
||
- OME-Zarr metadata will be stored under a dedicated key in the Zarr array or group attributes. The key will be a well-known URI of the OME-NGFF specification with a version number, e.g. `https://ngff.openmicroscopy.org/0.6`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"E.g" or that specific one for ever? Or it would be over starting with that specific uri and version component possibly changing?
Why not just "ngff" and have dedicated and clearly defined field "Version"? That would simplify accessing such key (instead of matching by leading uri if accommodating multiple versions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to reference the version in the key. Yes, that would require a prefix scan, which I don't think is a big problem because most implementation parse the entire json, anyways. The upside is that you could have multiple OME metadata versions in the same document, e.g. https://ngff.openmicroscopy.org/0.6
and https://ngff.openmicroscopy.org/1.0
.
@joshmoore suggested this approach in favor of having a generic "ome" key with an additional "version" key in #206 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's "cute" -- thanks for the explanation/reference! I wonder if use of a versioned URI to "allocate" a namespace here was ra- by a wider Zarr community? not that it would not be permitted but rather since I believe it is first/unique use there (IIRC it standed out whenever I saw it among other keys) and wider Zarr might either like to adopt it wider or raise some concerns.
My immediate question is: what should be a "standard" behavior when multiple versions are encountered? could be "the latter supported is taken". And either such behavior also should be formalized at Zarr or NGFF levels. Might be worth spelling out as well ahead of time if allowing for such multi-version specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's "cute" -- thanks for the explanation/reference! I wonder if use of a versioned URI to "allocate" a namespace here was ra- by a wider Zarr community? not that it would not be permitted but rather since I believe it is first/unique use there (IIRC it standed out whenever I saw it among other keys) and wider Zarr might either like to adopt it wider or raise some concerns.
There is an open discussion on the Zarr level about metadata conventions, of which NGFF/OME-Zarr is one. That discussion has not reached a conclusion yet. And, I don't think the OME community necessarily needs to wait for a conclusion.
Generally, URLs are used in other places in the Zarr spec, e.g. for names of extensions. So, it does not seem entirely foreign.
My immediate question is: what should be a "standard" behavior when multiple versions are encountered? could be "the latter supported is taken". And either such behavior also should be formalized at Zarr or NGFF levels. Might be worth spelling out as well ahead of time if allowing for such multi-version specification.
That is currently not defined. I almost think that implementations should decide which metadata they should follow. My thinking is that image data and metadata must match. That is true for every version of the metadata that may exist. So, implementations can choose which version they understand and use that to interpret the image data. If implementations support multiple versions, it would be reasonable to use the most recent one.
For example, you could have an OME-Zarr image that has the 0.4 metadata (based on Zarr v2) and the 0.6 metadata (based on Zarr v3). That is possible, because the chunk data is mostly compatible and just the json files differ. Implementations that don't support Zarr v3 can still read the v2 version and vice versa.
Co-authored-by: Yaroslav Halchenko <[email protected]>
I endorse this RFC. Some comments:
|
While I would like that, I didn't intend to change this behavior as part of this RFC. |
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/1 |
As described in https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/2, merging this to move forward with the first round of reviews. Thanks all for your feedback & endorsements. |
RFC: Zarr v3 SHA: 569c905 Reason: push, by @joshmoore Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Commenting solely in an individual capacity, I endorse the premise that "OME-Zarr should adopt Zarr v3 as the storage format". I have recently deployed Zarr v3 shards within Janelia on behalf of Philip Keller's Lab for use with neuroglancer, tensorstore, and other compatible tools. The content of the RFC itself is confusing. It seems to mostly replicate parts of the Zarr v3 specification and highlight key changes. Rather than duplicate such specification content, the RFC should mainly reference the canonical Zarr v3 specification. Lacking from the RFC is content pertinent to OME-Zarr as a standard. I would particularly like to see the following.
|
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/4 |
Thanks for your feedback @mkitti!
The OME metadata will be stored in the
OME-Zarr <0.5 only supports Zarr v2 and OME-Zarr ≥0.5 will only support Zarr v3. It is recommended that implementation support a number of OME-Zarr versions to support for reading existing data. I think that recommendation is useful not only for this RFC. This is discussed in this section in the RFC.
There is some mention of migration scripts in this section of the RFC. The next iteration of the RFC after the first round of reviews will also contain the changes to the spec document, with updated examples.
Yes, Zarr v2 and v3 metadata files can exist side-by-side. As a consequence, OME-Zarr 0.4 and 0.5 metadata should also be able to exist side-by-side. With the new versioned namespace, this will be even easier for future versions. We should probably add an explicit recommendation that implementations prefer the newest version of the metadata that they support. |
This is an RFC proposal for adopting Zarr v3 as the new storage format for OME-Zarr.
It is a followup to the discussions in #206 and on image.sc.
Briefly, this proposal aims to adopt Zarr v3 as the new format for the next version of OME-Zarr. That unlocks the new features of Zarr v3 including sharding. Zarr v2 would not be allowed anymore (only through older versions of OME-Zarr). Additionally, there are some small changes to the OME-Zarr metadata to improve namespacing and versioning.
This RFC is currently in draft status with the goal of clarifying questions before the full review. Additional endorsements are also welcome.
Check this link for a review: https://ngff--227.org.readthedocs.build/rfc/2/index.html