-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3: Define standard "URL" syntax for referencing a specific array, group, attribute within a zarr repository #132
Comments
Good idea! Pinging @martindurant for fsspec URI interoperability angle. |
From fsspec's point of view, URLs have a strict format, with the various pieces having meaning to the location of the artefact pointed to. If zarr were to use URLs to contain extra information, it had better not end up being passed to fsspec! Typical URL component delimiters like "/", ":", "#" and "?" are already overloaded by the various storage implementations. I find myself slightly puzzled over what advantage encoding an array's location in a single string brings. The example given, "zarr://...", obviously does not apply, since in this library, we are always dealing with zarr. Instead, we are wanting to encode something like "get subarray X from dataset at location Y", which sounds to me like it makes a lot of sense to maintain as two arguments, and so stick with python's "explicit is better than implicit" mantra. Probably there are other arguments as prominent as array path. From an Intake point of view, catalogs provide a place in which you can list a set of arguments to be passed to a backend like zarr, and name and describe that dataset - in a declarative way. There are other ways to do this too... |
A lot of different tools operate on arrays, and it is convenient to be able to specify the location of an array in a form that facilitates easy communication/interoperability between different people and tools. For example, it is convenient to have a "specifier" (be it a "URL" or something else) that can be copy-and-pasted directly between email messages, chat messages, command-line arguments, UI text entry boxes in various tools, etc. Certainly in principle this specifier does not have to have "URL-like" syntax, e.g. could be a JSON representation such as:
or
But I would say there are a lot of advantages to using URL-like syntax:
In general the same reasons why URL syntax is convenient for web browsers and fsspec itself also apply to specifying a zarr array. As for the "zarr://" part (i.e. specifying that the format is zarr), obviously that isn't necessary for software that only supports zarr, and perhaps they would exclude the "zarr+" or "zarr://" prefix. But a lot of tools (e.g. Neuroglancer) support multiple formats, and therefore need a way to specify the format as part of the URL. I would actually suggest that "zarr://" be reserved for zarr v2 and "zarr3://" or zarr3+" be used for zarr 3. |
There is an additional advantage to URL-like syntax that I forgot to mention: Neuroglancer will automatically list directories and supports tab completion when interactively typing in a data source URL. That is quite natural with URL syntax but is much more awkward and complicated with a JSON-like specifier. |
That seems to me very unlikely for any part of a URL that is format specific or storage specific. This is sort of the point I am getting at: URLs are really useful when they explicitly refer to a storage location. |
In web browsers, there is the As an aside, in Neuroglancer every data source (such as zarr) can define its own data source completion function, which indeed allows format-specific completion: File-backed data sources like zarr, n5, neuroglancer_precomputed, nifti currently just defer to a generic "file"-based completion implementation, but other data sources like bossdb and dvid use custom logic, and it would be straightforward to add support for completing the array path within a zarr v3 repository. |
For a specific case: what happens if your zarr requires query parameters on an HTTP URL? Or perhaps it lives in a local directory with "#" in the name. How do you separate the parts? |
Query parameters are fine --- they would work exactly the same as in web browsers, they would come before the fragment part. As for local directories with a "#" in the name, that can also be handled in the same way as web browsers, by percent-encoding the "#" and any other characters not allowed within the path component of the URL. |
OK, so defining internal path and any other arguments via (query) parameters is out?
Is the user doing this? Is the last "#" the only one that matters? Can you tell the difference between a normal ascii path with parameters and a path containing "#" without parameters? |
I'm not sure I fully understand what you are asking. If we didn't have any other need for query string parameters, then it would indeed work to use them to specify the path to the array within the zarr v3 repository. But if, as you gave as an example, the user wants to specify some query string parameters to send in the HTTP requests, then it is problematic if we are also using the query string parameters for a different purpose.
Are you thinking of the case where we want to have a single string that could either refer to a URL, or to a local path but without using a "file://" prefix or other prefix to disambiguate? Since there are almost no restrictions on the allowed characters in local paths, I would say that it is problematic to allow such syntax, and instead a "file://" prefix should always be required to refer to local paths. The user would indeed have to ensure any characters not valid in a URL are escaped when specifying the "file://" path. But I think it is reasonable for paths with unusual characters to cause the user a bit more trouble as long as there is a way to specify them, since such paths are rare in my experience and it is reasonable to discourage them. The most common special character is probably space, but that could be supported anyway since it is not ambiguous. As for multiple "#" characters in a URL --- in web browsers / normal URL syntax, everything after the first "#" is considered the fragment. But since we haven't yet defined what the zarr URL syntax would be it is hard to say what the meaning would be there. As an extreme example, you might want to be able to use a URL to refer to a zarr array that is found by:
I'm not sure what makes sense for this --- the usual |
Note that fsspec can already access a zip-in-a-zip-in-a-http with a URL formatted something like
I include the query and part just for illustrative purposes (I don't think I've ever seen a URL quite this complex). I wouldn't know where to add extra zarr-specific arguments to this, and I really don't think that zarr itself should be concerned with delving into nested storage layers. |
In your example URL:
what does the I see that fsspec uses |
One minor drawback I see with the inner-to-outer order used by fsspec is that it would be challenging to support completion since the user would need to type the parts in the reverse order that they occur. |
Just an example thing a URL might reasonably contain.
Yes, true - but I've never seen it. Note that percent encoding is only valid for HTTP(s) and not other possible backends. fsspec supports many backends.
True, but this is no longer easy to change. |
For HTTP by itself there would be no reason for a Perhaps there are other protocols supported by fsspec that do require a |
For example, GCS uses "#" to specify which version of a remote file is requested, and it must not be stripped from the string you send. |
Regarding GCS use of "#version" syntax, I see that is documented for the gsutil tool, though I'm struggling to find any documentation of that for the gcsfs fsspec driver. That does seem like an unfortunate choice, since it is inconsistent with the normal URL convention that the Perhaps an alternative way to specify the generation could be added, such as I will note that in the context of zarr it would not be useful to specify a generation number since a zarr repository consists of many individual files, each of which will have separate generation numbers. However a generation number could in principle be useful in the case of a zarr repository nested inside of a zip file stored on GCS. |
Again, you are talking only about HTTP, but there are many URL schemes. |
I'm not talking specifically about HTTP, but I am thinking in terms of how web browsers behave, since they are the most prominent user of "URL" syntax. There is RFC 3986 that defines a generic syntax for URIs, and it specifically describes the fragment portion as follows: In particular, it states:
Of course it is not necessary to actually follow this RFC, and indeed lots of software uses URL-like syntax for various purposes without adhering to this RFC. Additionally, this RFC specifically states that it is defining a common "generic syntax" for some URI schemes, but does not purport to define the syntax for all URI schemes. Still, I think when defining a new application for "URL-like" syntax, it is wise to adhere to common URI standards unless there is a compelling reason not to, as that allows users to leverage their existing knowledge and intuitions regarding "URLs". |
Lots of related discussion happening in #177 |
Since the entrypoint semantic might be removed (see #192), there's no need to have a separate part of a URL for the entrpoint and the array anymore. Instead, we could simply use a URL with a path which points to the array or group in question, e.g. This would be quite similar to v2. However, I think it's useful to standardize this to some extent, so that URLs pointing to an array/group might be re-used between different zarr-compatible tools. I'd propose to put this into the spec as a recommendation for now. |
Agreed. I think in the browser it should also default to file:// for the virtual local filesystem. I would never expect an HTTP URL to be without its protocol specifier. |
I think there are strong advantages to having the url point to an actual file, not a directory / prefix. For example, I'd much rather have the canonical URL be It's trivial for implementations to strip the |
Fine for me as well, no strong opinion here.
I'm wondering how this would work with #192 in place, where a zarr.json is not necessarily found. For example, there could be a v2 compat extension/storage transformer, where a v2 group get's an additional |
I was assuming that this URL syntax would only work for V3, in which every array or group will have a |
We don't have any specific examples of group-level storage transformers yet, but with any group-level storage transformer, we do need the URL to separately indicate the location to at least the outermost group with a storage transformer (basically the "root"), and then the path from there; otherwise we are back to searching upwards to find the "root". To avoid confusion it would be better that all group-level storage transformers also transform the zarr.json key to something unique to that group, to ensure it is not accidentally accessed without the storage transformer. |
I agree. Probably it is ok to defer the specification of that format until we have such an extension?
Which seems ok to me as well.
👍 For the moment I'd just specify a simple canonical URL form to open groups or arrays, which is Later, an additional form can append a string to the URL to specify a child which should be opened via a parent group, e.g. sth like However, I think the simple format should stay the default for all of the different alternatives, and is enough in the context of ZEP 1. |
It is convenient to be able to reference an array (or group or attribute) with a single string.
Existing zarr v2 implementations have invented various URL syntaxes. E.g. in Neuroglancer:
zarr://https://whatever/path/to/array
(Arguably a better syntax would have been:
zarr+https://whatever/path/to/array
)With zarr v3 it is necessary to separately specify the root and the path within the repository. To allow for better interoperability, it would be helpful for zarr v3 to specify a URL syntax to avoid divergence between implementations. Of course the zarr specification itself does not concern itself with the underlying storage mechanism such as https, but it would still be helpful to standardize a URL syntax.
The text was updated successfully, but these errors were encountered: