Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3: Define standard "URL" syntax for referencing a specific array, group, attribute within a zarr repository #132

Open
jbms opened this issue Feb 8, 2022 · 27 comments

Comments

@jbms
Copy link
Contributor

jbms commented Feb 8, 2022

It is convenient to be able to reference an array (or group or attribute) with a single string.

Existing zarr v2 implementations have invented various URL syntaxes. E.g. in Neuroglancer:

zarr://https://whatever/path/to/array

(Arguably a better syntax would have been: zarr+https://whatever/path/to/array)

With zarr v3 it is necessary to separately specify the root and the path within the repository. To allow for better interoperability, it would be helpful for zarr v3 to specify a URL syntax to avoid divergence between implementations. Of course the zarr specification itself does not concern itself with the underlying storage mechanism such as https, but it would still be helpful to standardize a URL syntax.

@rabernat
Copy link
Contributor

rabernat commented Feb 8, 2022

Good idea! Pinging @martindurant for fsspec URI interoperability angle.

@martindurant
Copy link
Member

From fsspec's point of view, URLs have a strict format, with the various pieces having meaning to the location of the artefact pointed to. If zarr were to use URLs to contain extra information, it had better not end up being passed to fsspec! Typical URL component delimiters like "/", ":", "#" and "?" are already overloaded by the various storage implementations.

I find myself slightly puzzled over what advantage encoding an array's location in a single string brings. The example given, "zarr://...", obviously does not apply, since in this library, we are always dealing with zarr. Instead, we are wanting to encode something like "get subarray X from dataset at location Y", which sounds to me like it makes a lot of sense to maintain as two arguments, and so stick with python's "explicit is better than implicit" mantra. Probably there are other arguments as prominent as array path.

From an Intake point of view, catalogs provide a place in which you can list a set of arguments to be passed to a backend like zarr, and name and describe that dataset - in a declarative way. There are other ways to do this too...

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

A lot of different tools operate on arrays, and it is convenient to be able to specify the location of an array in a form that facilitates easy communication/interoperability between different people and tools.

For example, it is convenient to have a "specifier" (be it a "URL" or something else) that can be copy-and-pasted directly between email messages, chat messages, command-line arguments, UI text entry boxes in various tools, etc.

Certainly in principle this specifier does not have to have "URL-like" syntax, e.g. could be a JSON representation such as:

["https://host/path/to/zarr_repo", "/path/to/array"]

or

{"driver": "zarr3", "kvstore": "https://host/path/to/zarr_repo", "array": "/path/to/array"}

But I would say there are a lot of advantages to using URL-like syntax:

  • Some existing software like Neuroglancer already uses URL syntax to identify data sources, and if there is no standardized URL syntax for zarr, one will be invented anyway.
  • Users are already accustomed to URLs and will likely be more readily able to copy and paste them without accidentally munging them.
  • A single string without quotes, spaces or other problematic characters is much easier to specify as a command-line argument.

In general the same reasons why URL syntax is convenient for web browsers and fsspec itself also apply to specifying a zarr array.

As for the "zarr://" part (i.e. specifying that the format is zarr), obviously that isn't necessary for software that only supports zarr, and perhaps they would exclude the "zarr+" or "zarr://" prefix. But a lot of tools (e.g. Neuroglancer) support multiple formats, and therefore need a way to specify the format as part of the URL.

I would actually suggest that "zarr://" be reserved for zarr v2 and "zarr3://" or zarr3+" be used for zarr 3.

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

There is an additional advantage to URL-like syntax that I forgot to mention: Neuroglancer will automatically list directories and supports tab completion when interactively typing in a data source URL. That is quite natural with URL syntax but is much more awkward and complicated with a JSON-like specifier.

@martindurant
Copy link
Member

Neuroglancer will automatically list directories and supports tab completion when interactively typing in a data source URL.

That seems to me very unlikely for any part of a URL that is format specific or storage specific. This is sort of the point I am getting at: URLs are really useful when they explicitly refer to a storage location.

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

In web browsers, there is the scheme://host/path?query part which identifies the resource to fetch, and then there is the optional #fragment part that just identifies something within that resource, by default an anchor tag within the webpage. This seems to me to be quite similar to the situation we have with zarr, where we have the repository storage location and a sub-resource within that. Using "#/path/to/array" would seem to be the most natural syntax choice for zarr, but you have said that fsspec already uses "#" for a different purpose, so perhaps a different syntax needs to be used, or perhaps it is possible to have multiple #fragment components in order to indicate multiple layers of nested resources. Can you provide some examples of how fsspec is already using #fragment syntax?

As an aside, in Neuroglancer every data source (such as zarr) can define its own data source completion function, which indeed allows format-specific completion:

https://github.com/google/neuroglancer/blob/60c8c036d202cecc7bc09e18b07dc86757bef200/src/neuroglancer/datasource/index.ts#L218

File-backed data sources like zarr, n5, neuroglancer_precomputed, nifti currently just defer to a generic "file"-based completion implementation, but other data sources like bossdb and dvid use custom logic, and it would be straightforward to add support for completing the array path within a zarr v3 repository.

@martindurant
Copy link
Member

In web browsers, there is the scheme://host/path?query part which identifies the resource to fetch, and then there is the optional #fragment part that just identifies something within that resource, by default an anchor tag within the webpage.

For a specific case: what happens if your zarr requires query parameters on an HTTP URL? Or perhaps it lives in a local directory with "#" in the name. How do you separate the parts?

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

Query parameters are fine --- they would work exactly the same as in web browsers, they would come before the fragment part.

As for local directories with a "#" in the name, that can also be handled in the same way as web browsers, by percent-encoding the "#" and any other characters not allowed within the path component of the URL.

@martindurant
Copy link
Member

martindurant commented Feb 8, 2022

Query parameters are fine --- they would work exactly the same as in web browsers, they would come before the fragment part.

OK, so defining internal path and any other arguments via (query) parameters is out?

As for local directories with a "#" in the name, that can also be handled in the same way as web browsers, by percent-encoding the "#" and any other characters not allowed within the path component of the URL.

Is the user doing this? Is the last "#" the only one that matters? Can you tell the difference between a normal ascii path with parameters and a path containing "#" without parameters?

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

Query parameters are fine --- they would work exactly the same as in web browsers, they would come before the fragment part.

OK, so defining internal path and any other parameters via parameters is out?

I'm not sure I fully understand what you are asking. If we didn't have any other need for query string parameters, then it would indeed work to use them to specify the path to the array within the zarr v3 repository. But if, as you gave as an example, the user wants to specify some query string parameters to send in the HTTP requests, then it is problematic if we are also using the query string parameters for a different purpose.

As for local directories with a "#" in the name, that can also be handled in the same way as web browsers, by percent-encoding the "#" and any other characters not allowed within the path component of the URL.

Is the user doing this? Is the last "#" the only one that matters? Can you tell the difference between a normal ascii path with parameters and a path containing "#" without parameters?

Are you thinking of the case where we want to have a single string that could either refer to a URL, or to a local path but without using a "file://" prefix or other prefix to disambiguate?

Since there are almost no restrictions on the allowed characters in local paths, I would say that it is problematic to allow such syntax, and instead a "file://" prefix should always be required to refer to local paths.

The user would indeed have to ensure any characters not valid in a URL are escaped when specifying the "file://" path. But I think it is reasonable for paths with unusual characters to cause the user a bit more trouble as long as there is a way to specify them, since such paths are rare in my experience and it is reasonable to discourage them. The most common special character is probably space, but that could be supported anyway since it is not ambiguous.

As for multiple "#" characters in a URL --- in web browsers / normal URL syntax, everything after the first "#" is considered the fragment. But since we haven't yet defined what the zarr URL syntax would be it is hard to say what the meaning would be there.

As an extreme example, you might want to be able to use a URL to refer to a zarr array that is found by:

  • Using http to access a given path with some query parameters: https://user:password@host:port/path/to/file.zip?some=parameter
  • Interpreting that as a zip file and accessing a given file within that zip file. inner/zip/file.zip
  • Interpreting the inner file as a zip file, and accessing a zarr repository at a given path within it: path/to/zarr_repo
  • Accessing a given array within the zarr repo: /path/to/array

I'm not sure what makes sense for this --- the usual format+protocol:// convention for layering multiple formats kind of breaks down here. But it is something we could keep in mind.

@martindurant
Copy link
Member

Note that fsspec can already access a zip-in-a-zip-in-a-http with a URL formatted something like

zip://inner/dir::zip://inner_file.zip::http://server:port/file/path.zip?query=1#part

I include the query and part just for illustrative purposes (I don't think I've ever seen a URL quite this complex). I wouldn't know where to add extra zarr-specific arguments to this, and I really don't think that zarr itself should be concerned with delving into nested storage layers.

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

In your example URL:

zip://inner/dir::zip://inner_file.zip::http://server:port/file/path.zip?query=1#part

what does the #part suffix indicate?

I see that fsspec uses :: as a way of nesting protocols. I suppose you just require that double colons in a path have to be percent-encoded?

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

One minor drawback I see with the inner-to-outer order used by fsspec is that it would be challenging to support completion since the user would need to type the parts in the reverse order that they occur.

@martindurant
Copy link
Member

what does the #part suffix indicate?

Just an example thing a URL might reasonably contain.

I see that fsspec uses :: as a way of nesting protocols. I suppose you just require that double colons in a path have to be percent-encoded?

Yes, true - but I've never seen it. Note that percent encoding is only valid for HTTP(s) and not other possible backends. fsspec supports many backends.

One minor drawback I see with the inner-to-outer order used by fsspec is that it would be challenging to support completion

True, but this is no longer easy to change.

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

For HTTP by itself there would be no reason for a #part suffix --- the #fragment part isn't sent to the server --- it is just for interpretation by the local client. So in this case we would be free to use it for the path to the zarr array. Only in some non-standard use of HTTP (that I have never seen) would you send a request that includes a #part suffix.

Perhaps there are other protocols supported by fsspec that do require a #fragment suffix for something else, though?

@martindurant
Copy link
Member

For example, GCS uses "#" to specify which version of a remote file is requested, and it must not be stripped from the string you send.

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

Regarding GCS use of "#version" syntax, I see that is documented for the gsutil tool, though I'm struggling to find any documentation of that for the gcsfs fsspec driver.

That does seem like an unfortunate choice, since it is inconsistent with the normal URL convention that the #fragment is for local use only.

Perhaps an alternative way to specify the generation could be added, such as ?generation=xxx; in direct use of fsspec both syntaxes could be supported, but when using a "zarr" URL the fragment could be used by zarr and then the ?generation=xxx form would have to be used.

I will note that in the context of zarr it would not be useful to specify a generation number since a zarr repository consists of many individual files, each of which will have separate generation numbers. However a generation number could in principle be useful in the case of a zarr repository nested inside of a zip file stored on GCS.

@martindurant
Copy link
Member

it is inconsistent with the normal URL convention that the #fragment is for local use only.

Again, you are talking only about HTTP, but there are many URL schemes.

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

I'm not talking specifically about HTTP, but I am thinking in terms of how web browsers behave, since they are the most prominent user of "URL" syntax.

There is RFC 3986 that defines a generic syntax for URIs, and it specifically describes the fragment portion as follows:
https://datatracker.ietf.org/doc/html/rfc3986#section-3.5

In particular, it states:

Fragment identifier semantics are independent of the URI scheme and thus cannot be redefined by scheme specifications.

Of course it is not necessary to actually follow this RFC, and indeed lots of software uses URL-like syntax for various purposes without adhering to this RFC. Additionally, this RFC specifically states that it is defining a common "generic syntax" for some URI schemes, but does not purport to define the syntax for all URI schemes.

Still, I think when defining a new application for "URL-like" syntax, it is wise to adhere to common URI standards unless there is a compelling reason not to, as that allows users to leverage their existing knowledge and intuitions regarding "URLs".

@rabernat
Copy link
Contributor

rabernat commented Dec 1, 2022

Lots of related discussion happening in #177

@jstriebel
Copy link
Member

Since the entrypoint semantic might be removed (see #192), there's no need to have a separate part of a URL for the entrpoint and the array anymore. Instead, we could simply use a URL with a path which points to the array or group in question, e.g. file:///my/path/to/a/group/inner_array. If no scheme is set, the implementation can fall back to a useful default (e.g. file:// for local programs or https:// in the browser). Stores can then define a default URL scheme, that's used to load from the specific store.

This would be quite similar to v2. However, I think it's useful to standardize this to some extent, so that URLs pointing to an array/group might be re-used between different zarr-compatible tools. I'd propose to put this into the spec as a recommendation for now.

@martindurant
Copy link
Member

This would be quite similar to v2. However, I think it's useful to standardize this to some extent

Agreed.

I think in the browser it should also default to file:// for the virtual local filesystem. I would never expect an HTTP URL to be without its protocol specifier.

@rabernat
Copy link
Contributor

I think there are strong advantages to having the url point to an actual file, not a directory / prefix. For example, I'd much rather have the canonical URL be file:///my/path/to/a/group/inner_array/zarr.json.

It's trivial for implementations to strip the zarr.json from the URL. On the other hand, without zarr.json, stores without the concept of a directory (e.g. S3) have no way to ask "does this path exist?" without performing a list-prefix type operation.

@jstriebel
Copy link
Member

For example, I'd much rather have the canonical URL be file:///my/path/to/a/group/inner_array/zarr.json.

Fine for me as well, no strong opinion here.

It's trivial for implementations to strip the zarr.json from the URL. On the other hand, without zarr.json, stores without the concept of a directory (e.g. S3) have no way to ask "does this path exist?" without performing a list-prefix type operation.

I'm wondering how this would work with #192 in place, where a zarr.json is not necessarily found. For example, there could be a v2 compat extension/storage transformer, where a v2 group get's an additional zarr.json, which could make everything in the group v3 compatible. However, an array wouldn't have the zarr.json, and a client would need to traverse the parents until the group with the zarr.json is found (at least that's the current proposal in #192). In this case the url to a non-extisting zarr.json might be counter-intuitive.
Also, a client might always add the zarr.json suffix to check if the file exists. The question is mainly what the standard URL should look like that the user sees and copies between tools?

@rabernat
Copy link
Contributor

I was assuming that this URL syntax would only work for V3, in which every array or group will have a zarr.json. Addressing V2 in a backwards-compatible way is definitely more tricky. Implicit groups (with no zarr.json, just a bare directory) would also be a problem.

@jbms
Copy link
Contributor Author

jbms commented Jan 11, 2023

I was assuming that this URL syntax would only work for V3, in which every array or group will have a zarr.json. Addressing V2 in a backwards-compatible way is definitely more tricky. Implicit groups (with no zarr.json, just a bare directory) would also be a problem.

We don't have any specific examples of group-level storage transformers yet, but with any group-level storage transformer, we do need the URL to separately indicate the location to at least the outermost group with a storage transformer (basically the "root"), and then the path from there; otherwise we are back to searching upwards to find the "root". To avoid confusion it would be better that all group-level storage transformers also transform the zarr.json key to something unique to that group, to ensure it is not accidentally accessed without the storage transformer.

@jstriebel
Copy link
Member

We don't have any specific examples of group-level storage transformers yet, but with any group-level storage transformer, we do need the URL to separately indicate the location to at least the outermost group with a storage transformer (basically the "root"), and then the path from there;

I agree. Probably it is ok to defer the specification of that format until we have such an extension?

otherwise we are back to searching upwards to find the "root".

Which seems ok to me as well.

To avoid confusion it would be better that all group-level storage transformers also transform the zarr.json key to something unique to that group, to ensure it is not accidentally accessed without the storage transformer.

👍


For the moment I'd just specify a simple canonical URL form to open groups or arrays, which is file:///my/path/to/a/group/inner_array/zarr.json for file stores.

Later, an additional form can append a string to the URL to specify a child which should be opened via a parent group, e.g. sth like file:///my/path/to/a/group/zarr.json#inner_array or file:///my/path/to/a/group/zarr.json?child=inner_array, or we settle on the "searching upwards" strategy. Another third alternative might be to mark such groups with a specific name which marks them as an entrypoint which must be loaded, e.g. file:///my/path/to/a/group.zarr_entrypoint/inner_array.zarr.json.

However, I think the simple format should stay the default for all of the different alternatives, and is enough in the context of ZEP 1.

@jstriebel jstriebel moved this from In Discussion to In Review in ZEP1 Feb 22, 2023
@jstriebel jstriebel moved this from In Review to Done in ZEP1 Mar 13, 2023
@jstriebel jstriebel added core-protocol-v3.1 and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

4 participants