Support concat and merge dimensions with `HDFReferenceRecipe` #277

TomAugspurger · 2022-02-04T20:53:22Z

Currently, HDFReferenceRecipe doesn't work properly with both a concat dim while merging multiple variables.

I suspect that this is blocked by the upstream issue in Kerchunk: fsspec/kerchunk#106 (comment). Filing this to make sure we check back here when unblocked.

Here's the code for the recipe (runs in ~8s in the West Europe Azure region)

import planetary_computer
import adlfs
import datetime
import pangeo_forge_recipes.patterns
import pangeo_forge_recipes.recipes
import tempfile
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.storage import FSSpecTarget, MetadataTarget


fs = adlfs.AzureBlobFileSystem("ukmeteuwest")
credential = planetary_computer.sas.get_token("ukmeteuwest", "ukcp18").token
fs = adlfs.AzureBlobFileSystem("ukmeteuwest", credential=credential)
files = [x for x in fs.find("ukcp18/") if x.endswith(".nc")]
daily_files = [x for x in files if "/day/" in x]

# limit to 3 for testing
time_ranges = [f.split("_")[-1].split(".")[0] for f in daily_files][:3]
time_ranges = {i: r for i, r in enumerate(dict.fromkeys(time_ranges))}


def make_path(variable, time):
    return f"az://ukcp18/badc/ukcp18/data/land-gcm/global/60km/rcp26/01/{variable}/day/v20200302/{variable}_rcp26_land-gcm_global_60km_01_day_{time_ranges[time]}.nc"

# variables = ['clt', 'hurs', 'huss', 'pr', 'psl', 'rls', 'rss', 'tas', 'tasmax', 'tasmin']
# limit to 2 for testing
variables = ["clt", "hurs"]
variable_merge_dim = pangeo_forge_recipes.patterns.MergeDim("variable", variables)
time_concat_dim = pangeo_forge_recipes.patterns.ConcatDim("time", list(time_ranges))

pattern = pangeo_forge_recipes.patterns.FilePattern(make_path, variable_merge_dim, time_concat_dim)
recipe = pangeo_forge_recipes.recipes.reference_hdf_zarr.HDFReferenceRecipe(
    pattern,
    netcdf_storage_options=dict(account_name="ukmeteuwest", credential=credential),
    template_count=1,
)

fs_local = LocalFileSystem()

cache_dir = tempfile.TemporaryDirectory()
metadata_cache = MetadataTarget(fs_local, cache_dir.name)

target_dir = tempfile.TemporaryDirectory()
target = FSSpecTarget(fs_local, target_dir.name)

recipe.target = target
recipe.metadata_cache = metadata_cache

recipe.to_dask().compute()

And to load it and the expected output from xarray.open_mfsdatset.

import pathlib
import json
import itertools
import planetary_computer
import fsspec
import copy
import xarray as xr


refs = json.loads(pathlib.Path(target._full_path("reference.json")).read_text())
# Workaround a separate issue
refs["refs"] = {k: v for k, v in refs["refs"].items() if "yyymmdd" not in k}
remote_options = dict(account_name="ukmeteuwest", credential=credential)
store = fsspec.filesystem("reference", fo=refs, remote_options=remote_options).get_mapper("")
ds = xr.open_dataset(store, engine="zarr", chunks={}, consolidated=False, decode_times=False)

# compare against the expected
clt_files = [x for x in daily_files if "/clt/" in x][:3]
ds2 = xr.open_mfdataset([fs.open(f) for f in clt_files], concat_dim="time", combine="nested", join="exact")
ds2.clt

Checking the output

>>> assert ds.clt.shape == ds2.clt.shape, (ds.clt.shape, "!=", ds2.clt.shape)
AssertionError                            Traceback (most recent call last)
Input In [28], in <module>
----> 1 assert ds.clt.shape == ds2.clt.shape, (ds.clt.shape, "!=", ds2.clt.shape)

AssertionError: ((1, 21600, 324, 432), '!=', (1, 10800, 324, 432))

It seems that the first variable, clt, is included. The second, hurs is not present in the store / Dataset. Additionally, the one variable is twice as long in the time dimension as it should be.

Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so ds.clt[0, 0].compute() is slow. That could be user error though.

The text was updated successfully, but these errors were encountered:

rabernat · 2022-02-04T20:58:11Z

Thanks for posting Tom. This is indeed an important and very common use case we need to support.

martindurant · 2022-02-04T20:59:19Z

fsspec/kerchunk#122 allows merging variables in kerchunk at the same time as concatenating dimensions, but I'm not certain you need it here.

TomAugspurger · 2022-02-04T21:21:52Z

[Martin] fsspec/kerchunk#122 allows merging variables in kerchunk at the same time as concatenating dimensions

Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr not taking xarary_open_kwargs anymore. I didn't investigate any further.

[Tom] Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so ds.clt[0, 0].compute() is slow. That could be user error though.

This is looking more like user error / misunderstanding. It turns out that this dataset isn't chunked:

>>> import h5py

>>> f = h5py.File(fsspec.open(list(pattern.items())[0][1], **remote_options).open())
>>> clt = f["clt"]
>>> clt.chunks  # None

And yet somehow some combination of xarray / fsspec / h5netcdf / h5py are smart enough to not request all the data when you do a slice:

%time clt[0, 0]
CPU times: user 16.4 ms, sys: 5.05 ms, total: 21.5 ms
Wall time: 64.6 ms

The full dataset is ~2GiB, which takes closer to 24s to read. We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.

martindurant · 2022-02-04T21:40:03Z

Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr not taking xarary_open_kwargs anymore. I didn't investigate any further.

Yes, the signature changed a lot (sorry) and the code no longer uses xarray at all, in favour of direct JSON mangling. The docstring is up to date, though, and I think I know the specific things that don't work, listed in the description of the PR.

We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.

Certainly on the cards, and this is actively discussed for FITS, where uncompressed cingle chunks are common, but I didn't think that would be the case for HDF5.

TomAugspurger · 2022-02-04T22:05:53Z

👍, opened fsspec/kerchunk#124 for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support concat and merge dimensions with `HDFReferenceRecipe` #277

Support concat and merge dimensions with `HDFReferenceRecipe` #277

TomAugspurger commented Feb 4, 2022 •

edited

Loading

rabernat commented Feb 4, 2022

martindurant commented Feb 4, 2022

TomAugspurger commented Feb 4, 2022 •

edited

Loading

martindurant commented Feb 4, 2022

TomAugspurger commented Feb 4, 2022

Support concat and merge dimensions with HDFReferenceRecipe #277

Support concat and merge dimensions with HDFReferenceRecipe #277

Comments

TomAugspurger commented Feb 4, 2022 • edited Loading

rabernat commented Feb 4, 2022

martindurant commented Feb 4, 2022

TomAugspurger commented Feb 4, 2022 • edited Loading

martindurant commented Feb 4, 2022

TomAugspurger commented Feb 4, 2022

Support concat and merge dimensions with `HDFReferenceRecipe` #277

Support concat and merge dimensions with `HDFReferenceRecipe` #277

TomAugspurger commented Feb 4, 2022 •

edited

Loading

TomAugspurger commented Feb 4, 2022 •

edited

Loading