Add the ability to split large chunks #124

TomAugspurger · 2022-02-04T22:05:15Z

Currently, translating HDF5 to Zarr will result in a Zarr store with identical chunks as the source. If the source isn't chunked, this will cause worse performance when you slice a subset of the original data, since fsspec will make the full range request.

Here's a Kerchunked file

import planetary_computer
import adlfs
import kerchunk.hdf
import xarray as xr
import fsspec

credential = planetary_computer.sas.get_token("ukmeteuwest", "ukcp18").token
file = "az://ukcp18/badc/ukcp18/data/land-gcm/global/60km/rcp26/01/clt/day/v20200302/clt_rcp26_land-gcm_global_60km_01_day_18991201-19091130.nc"

storage_options = dict(account_name="ukmeteuwest", credential=credential)
file = daily_files[0]
with fsspec.open(file, **storage_options) as f:
    d = kerchunk.hdf.SingleHdf5ToZarr(f, file).translate()

store = fsspec.filesystem("reference", fo=d, remote_options=storage_options).get_mapper("")

ds = xr.open_zarr(store, consolidated=False, chunks={})
ds

Timing small reads

%time ds.clt[0, 0].compute()
CPU times: user 5.54 s, sys: 1.5 s, total: 7.04 s
Wall time: 23.3 s

Compared with the non-kerchunked version

ds2 = xr.open_dataset(fsspec.open(file, **storage_options).open(), engine="h5netcdf")

%time ds2.clt[0, 0].compute()
CPU times: user 22.2 ms, sys: 8.51 ms, total: 30.7 ms
Wall time: 70.2 ms

Having the flexibility to make smaller requests by splitting large ranges into separate chunks would be helpful, if it's feasible for the backend (which it should be for these large, contiguous buffers from HDF5).

martindurant · 2022-02-04T22:37:11Z

if it's feasible for the backend

NOT when that chunk is compressed with something that doesn't have clear internal blocks (e.g., gzip). Zarr does not support streaming of any sort, it only knows which blocks you want, so you need to be able to cleanly subdivide the whole thing, which is easy for uncompressed buffers, and possible for block-compressed buffers (e.g., Zstd).

I wonder if in your example it is as fast to get the last point of the array?

martindurant · 2022-04-21T21:54:40Z

cf #134 , which is a similar concept ( @d70-t )

TomAugspurger changed the title ~~Add the ability to split chunks~~ Add the ability to split large chunks Feb 4, 2022

TomAugspurger mentioned this issue Feb 4, 2022

Support concat and merge dimensions with HDFReferenceRecipe pangeo-forge/pangeo-forge-recipes#277

Open

martindurant mentioned this issue Feb 25, 2022

Apply kerchunk to one-timestep-per-file (climate model) output #123

Open

samlamont mentioned this issue Jul 13, 2023

AWS Retrospective NOAA-OWP/hydrotools#226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to split large chunks #124

Add the ability to split large chunks #124

TomAugspurger commented Feb 4, 2022

martindurant commented Feb 4, 2022 •

edited

Loading

martindurant commented Apr 21, 2022

Add the ability to split large chunks #124

Add the ability to split large chunks #124

Comments

TomAugspurger commented Feb 4, 2022

martindurant commented Feb 4, 2022 • edited Loading

martindurant commented Apr 21, 2022

martindurant commented Feb 4, 2022 •

edited

Loading