Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zarr compatibility functions #478

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ghidalgo3
Copy link
Contributor

This is a continuation of #475 because I identified that there are several more behavior changes in zarr v3. Going forward, the utils.py file will abstract away the differences between ZarrV2 and ZarrV3 depending on what the user has installed.

Note that @jhamman's PR #292 may now be outdated, though I'll merge those commits into this PR if its desired to expose a zarr_version to kerchunk callers.

@ghidalgo3 ghidalgo3 marked this pull request as ready for review July 16, 2024 00:53
kerchunk/fits.py Show resolved Hide resolved
@@ -116,9 +140,32 @@ def rename_target_files(
ujson.dump(new, f)


def _encode_for_JSON(store):
def zarr_init_group_and_store(store=None, zarr_version=None, overwrite=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth checking whether these new are appropriate for the public API of kerchunk (IMO, no. I'd recommend prefixing the function names with underscores or putting them in a _utils.py.

@@ -116,9 +140,32 @@ def rename_target_files(
ujson.dump(new, f)


def _encode_for_JSON(store):
def zarr_init_group_and_store(store=None, zarr_version=None, overwrite=True):
zarr_version = zarr_version or 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the semantics of zarr_version here (and in zarr_open)? What's the behavior of each of these cases?

zarr-python-library-version zarr_version behavior
2.x None write zarr v2
2.x 2 write zarr v2
2.x 3 error
3.x None write zarr v2 or v3?
3.x 2 write zarr v2
3.x 3 write zarr v3

Really it's just the case of zarr_version=None with zarr-python 3.x that I'm unsure about. i.e. what's the default behavior: write zarr v2, or write whatever version is the default for that version of zarr-python?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth confirming that we do error for zarr_version=3 with zarr-python=2, to not silently ignore that keyword argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, if you have zarrv2 installed and you set zarr_version=3 then zarr will accept that, issue a warning, and give you a valid group. I need to re-run these tests with zarrv3 installed and see what happens in those cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr2 (i.e., mainline released version) has had v3 internally for quite a while, but the implementation is different. It should still conform to the v3 spec, though!

@martindurant
Copy link
Member

Thank you for putting this together, you have clearly done a careful job (and thanks to @jhamman for the earlier attempt). This will take some time to go through.

I would encourage you to also look at the implementation in fsspec.implementations.reference, where I expect there will be more zarr2-specific assumptions, particularly in the parquet implementation.

@martindurant
Copy link
Member

Following @TomAugspurger 's comments, it may well make sense to have None mean V2 for now, whichever zarr version is installed

@ghidalgo3
Copy link
Contributor Author

Thank you for putting this together, you have clearly done a careful job (and thanks to @jhamman for the earlier attempt). This will take some time to go through.

I would encourage you to also look at the implementation in fsspec.implementations.reference, where I expect there will be more zarr2-specific assumptions, particularly in the parquet implementation.

In fsspec I found that fsspec.mapping.FSMap or LazyReferenceMapper objects are passed to zarr.open and zarr.group calls. In ZarrV2, that worked because both of those types are MutableMappings and ZarrV2 knew how to work with MutableMapping stores.

However in ZarrV3, you need to pass a StoreLike object to any of the functions that previously expected MutableMappings. That means fsspec isn't compatible with zarrv3 either :) I'm not sure if these compatibility functions should also be defined at the fsspec layer or if zarr should provide a way to turn MutableMappings into one of its Store objects and preserve API compatibility. I think the latter is preferrable because that allows more users (not just fsspec and kerchunk) to easily adopt zarrv3.

@martindurant
Copy link
Member

I found that fsspec.mapping.FSMap or LazyReferenceMapper objects are passed to zarr.open and zarr.group calls.

No, this is not the model.

  • LazyReferenceMapper provides references to referenceFS. It is also map-like, but further work must be done to turn references to bytes
  • FSMap can be passed directly to zarr2, but it wraps it in a Store too, made for generic dict-like input
  • there is also an FSspecStore, which is what you get if you pass a URL to zarr.open
  • v3 has its own version of fsspecstore Basic working FsspecStore zarr-developers/zarr-python#1785

@ghidalgo3
Copy link
Contributor Author

Sorry I didn't grok that. Here I see fsspec tests creating LazyReferenceMappers and passing them to zarr.open. In ZarrV3, all the object creation functions run the StoreLike through this function. I don't see how a RemoteStore would be created unless a user explicitly passed it to the zarr.open call like this:

store = RemoteStore(url="abfs://...")
zarr.open(store=store)

there is also an FSspecStore, which is what you get if you pass a URL to zarr.open

Unless I'm missing it, passing a URL to zarr.open in zarrV3 will return a local store after it's run through make_store_path:

def make_store_path(store_like: StoreLike | None, *, mode: OpenMode | None = None) -> StorePath:
    if isinstance(store_like, StorePath):
        if mode is not None:
            assert mode == store_like.store.mode
        return store_like
    elif isinstance(store_like, Store):
        if mode is not None:
            assert mode == store_like.mode
        return StorePath(store_like)
    elif store_like is None:
        if mode is None:
            mode = "w"  # exception to the default mode = 'r'
        return StorePath(MemoryStore(mode=mode))
    elif isinstance(store_like, str):
        return StorePath(LocalStore(Path(store_like), mode=mode or "r"))
    raise TypeError

Did this work in ZarrV2?

zarr.open(store="abfs://...")?

@martindurant
Copy link
Member

passing a URL to zarr.open in zarrV3 will return a local store after it's run through make_store_path

zarr.open(store="abfs://...")?

Yes, this most certainly worked, and if it no longer does, I consider this a major regression and worse user experience. The optional argument storage_options gets passed to fsspec.

I see fsspec tests creating LazyReferenceMappers and passing them to zarr.open

I suppose you are right and this works too - the lazy mapper is also dict-like. It's probably a case of my lazyness in the tests; it only works where the "files" are stored as raw binary, not as references (which is the case for the tests).

@martindurant
Copy link
Member

cc @jhamman

@ghidalgo3
Copy link
Contributor Author

Yes, this most certainly worked, and if it no longer does, I consider this a major regression and worse user experience. The optional argument storage_options gets passed to fsspec.

It doesn't work in ZarrV3 yet:

# Missing storage_options but it's irrelevant
# this eventually tries to construct a StorePath(LocalStore, 'file://abfs:/daymet-zarr/daily/hi.zarr')
# and will throw because it's not writable
zarr.open(store="abfs://daymet-zarr/daily/hi.zarr")

I see fsspec tests creating LazyReferenceMappers and passing them to zarr.open

I suppose you are right and this works too - the lazy mapper is also dict-like. It's probably a case of my lazyness in the tests; it only works where the "files" are stored as raw binary, not as references (which is the case for the tests).

Laziness can be good, and your tests could probably use the MemoryStore at some point, but ZarrV3 shouldn't require you to change your call sites IMO.

@martindurant
Copy link
Member

Here probably - explicitly expects .zarray in every path containing data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants