Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support concat and merge dimensions with HDFReferenceRecipe #277

Open
TomAugspurger opened this issue Feb 4, 2022 · 5 comments
Open

Support concat and merge dimensions with HDFReferenceRecipe #277

TomAugspurger opened this issue Feb 4, 2022 · 5 comments

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 4, 2022

Currently, HDFReferenceRecipe doesn't work properly with both a concat dim while merging multiple variables.

I suspect that this is blocked by the upstream issue in Kerchunk: fsspec/kerchunk#106 (comment). Filing this to make sure we check back here when unblocked.

Here's the code for the recipe (runs in ~8s in the West Europe Azure region)

import planetary_computer
import adlfs
import datetime
import pangeo_forge_recipes.patterns
import pangeo_forge_recipes.recipes
import tempfile
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.storage import FSSpecTarget, MetadataTarget


fs = adlfs.AzureBlobFileSystem("ukmeteuwest")
credential = planetary_computer.sas.get_token("ukmeteuwest", "ukcp18").token
fs = adlfs.AzureBlobFileSystem("ukmeteuwest", credential=credential)
files = [x for x in fs.find("ukcp18/") if x.endswith(".nc")]
daily_files = [x for x in files if "/day/" in x]

# limit to 3 for testing
time_ranges = [f.split("_")[-1].split(".")[0] for f in daily_files][:3]
time_ranges = {i: r for i, r in enumerate(dict.fromkeys(time_ranges))}


def make_path(variable, time):
    return f"az://ukcp18/badc/ukcp18/data/land-gcm/global/60km/rcp26/01/{variable}/day/v20200302/{variable}_rcp26_land-gcm_global_60km_01_day_{time_ranges[time]}.nc"

# variables = ['clt', 'hurs', 'huss', 'pr', 'psl', 'rls', 'rss', 'tas', 'tasmax', 'tasmin']
# limit to 2 for testing
variables = ["clt", "hurs"]
variable_merge_dim = pangeo_forge_recipes.patterns.MergeDim("variable", variables)
time_concat_dim = pangeo_forge_recipes.patterns.ConcatDim("time", list(time_ranges))

pattern = pangeo_forge_recipes.patterns.FilePattern(make_path, variable_merge_dim, time_concat_dim)
recipe = pangeo_forge_recipes.recipes.reference_hdf_zarr.HDFReferenceRecipe(
    pattern,
    netcdf_storage_options=dict(account_name="ukmeteuwest", credential=credential),
    template_count=1,
)

fs_local = LocalFileSystem()

cache_dir = tempfile.TemporaryDirectory()
metadata_cache = MetadataTarget(fs_local, cache_dir.name)

target_dir = tempfile.TemporaryDirectory()
target = FSSpecTarget(fs_local, target_dir.name)

recipe.target = target
recipe.metadata_cache = metadata_cache

recipe.to_dask().compute()

And to load it and the expected output from xarray.open_mfsdatset.

import pathlib
import json
import itertools
import planetary_computer
import fsspec
import copy
import xarray as xr


refs = json.loads(pathlib.Path(target._full_path("reference.json")).read_text())
# Workaround a separate issue
refs["refs"] = {k: v for k, v in refs["refs"].items() if "yyymmdd" not in k}
remote_options = dict(account_name="ukmeteuwest", credential=credential)
store = fsspec.filesystem("reference", fo=refs, remote_options=remote_options).get_mapper("")
ds = xr.open_dataset(store, engine="zarr", chunks={}, consolidated=False, decode_times=False)

# compare against the expected
clt_files = [x for x in daily_files if "/clt/" in x][:3]
ds2 = xr.open_mfdataset([fs.open(f) for f in clt_files], concat_dim="time", combine="nested", join="exact")
ds2.clt

Checking the output

>>> assert ds.clt.shape == ds2.clt.shape, (ds.clt.shape, "!=", ds2.clt.shape)
AssertionError                            Traceback (most recent call last)
Input In [28], in <module>
----> 1 assert ds.clt.shape == ds2.clt.shape, (ds.clt.shape, "!=", ds2.clt.shape)

AssertionError: ((1, 21600, 324, 432), '!=', (1, 10800, 324, 432))

It seems that the first variable, clt, is included. The second, hurs is not present in the store / Dataset. Additionally, the one variable is twice as long in the time dimension as it should be.

Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so ds.clt[0, 0].compute() is slow. That could be user error though.

@rabernat
Copy link
Contributor

rabernat commented Feb 4, 2022

Thanks for posting Tom. This is indeed an important and very common use case we need to support.

@martindurant
Copy link
Contributor

fsspec/kerchunk#122 allows merging variables in kerchunk at the same time as concatenating dimensions, but I'm not certain you need it here.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 4, 2022

[Martin] fsspec/kerchunk#122 allows merging variables in kerchunk at the same time as concatenating dimensions

Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr not taking xarary_open_kwargs anymore. I didn't investigate any further.


[Tom] Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so ds.clt[0, 0].compute() is slow. That could be user error though.

This is looking more like user error / misunderstanding. It turns out that this dataset isn't chunked:

>>> import h5py

>>> f = h5py.File(fsspec.open(list(pattern.items())[0][1], **remote_options).open())
>>> clt = f["clt"]
>>> clt.chunks  # None

And yet somehow some combination of xarray / fsspec / h5netcdf / h5py are smart enough to not request all the data when you do a slice:

%time clt[0, 0]
CPU times: user 16.4 ms, sys: 5.05 ms, total: 21.5 ms
Wall time: 64.6 ms

The full dataset is ~2GiB, which takes closer to 24s to read. We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.

@martindurant
Copy link
Contributor

Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr not taking xarary_open_kwargs anymore. I didn't investigate any further.

Yes, the signature changed a lot (sorry) and the code no longer uses xarray at all, in favour of direct JSON mangling. The docstring is up to date, though, and I think I know the specific things that don't work, listed in the description of the PR.

We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.

Certainly on the cards, and this is actively discussed for FITS, where uncompressed cingle chunks are common, but I didn't think that would be the case for HDF5.

@TomAugspurger
Copy link
Contributor Author

👍, opened fsspec/kerchunk#124 for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants