-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V3: node_type considered harmful #269
Comments
Dennis, it's great to finally have your feedback. This issue was discussed extensively nearly a year ago, see #177 You're very active today with many great suggestions. Unfortunately, the window for major changes to the V3 spec has passed. On May 8, @WardF voted yes on ZEP-1 on behalf of Unidata: #227 (comment). Now that you are actually trying to implement Zarr V3, you are finding many problems with the spec. I think this is very natural and to be expected. I feel we got the sequencing wrong by first writing this massive spec without really trying to implement it. But if we stick to the process we have defined for ourselves, the comment period is over. I'd welcome thoughts on how to move forward in a productive way. |
A key point that came up in the discussions was that typically when listing the contents of a zarr hierarchy, you will most likely want more information than just whether it is a group or an array, such as the shapes and data types of any arrays, and would need to read the metadata anyway. |
"When listing the contents of a zarr hierachy..." Agreed, |
I am in the process of implementing V3, and unfortunately, the only way I have of really understanding |
Can you say a bit more what "lazy loading" means? Is the issue that your API provides a list of all of the arrays within a group (but no other information)? |
Yes. Also with groups. We also do lazy attribute loading which is why I support storing attributes as a separate object. |
Could you add a new API function that returns the list of all members, sub-groups and arrays, and recommend that users call that instead for greater efficiency when using zarr v3? |
Not sure I follow. There seems -- to the client -- not a lot of difference in calling |
If there is one API function that lists both arrays and groups, and just returns a flat list of names without indicating which are arrays and which are groups, then you don't need to read the zarr.json files to handle calls to that API function. |
I see. Our API does differentiate. And this is important because it is possible to walk the group tree |
Yes, I am very optimistic that we can incorporate your suggestions and input. FWIW, I think it would be better to evolve the spec incrementally towards a "final" version as we work on implementing it, rather than deciding on the full spec up front and approving it through a big vote. The way STAC has developed their spec seems to be more effective.
I see this as a very important goal. The algorithm for walking a hierarchy is considerably more complex if you can't identify whether a node is a group or array based on the node name. As some point, the decision was made that the cost of reading the metadata files was not significant. However, I'd like to see some quantitative analysis to confirm this for realistic datasets. |
I believe that putting the node type (group or array) into the zarr.info object is not a good idea.
In Zarr v2, the node type was encoded in the object name: .zgroup or .zarray. Hence it is possible
to identify all the groups and all the arrays without having to read any object.
With Zarr V3, you can only identify the groups and arrays by reading each occurrence of the zarr.info object.
This is potentially a significant performance hit.
I would propose a change to something like "zgroup.json" and "zarray.json".
The text was updated successfully, but these errors were encountered: