-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support concat and merge dimensions with HDFReferenceRecipe
#277
Comments
Thanks for posting Tom. This is indeed an important and very common use case we need to support. |
fsspec/kerchunk#122 allows merging variables in kerchunk at the same time as concatenating dimensions, but I'm not certain you need it here. |
Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with
This is looking more like user error / misunderstanding. It turns out that this dataset isn't chunked: >>> import h5py
>>> f = h5py.File(fsspec.open(list(pattern.items())[0][1], **remote_options).open())
>>> clt = f["clt"]
>>> clt.chunks # None And yet somehow some combination of xarray / fsspec / h5netcdf / h5py are smart enough to not request all the data when you do a slice: %time clt[0, 0]
CPU times: user 16.4 ms, sys: 5.05 ms, total: 21.5 ms
Wall time: 64.6 ms The full dataset is ~2GiB, which takes closer to 24s to read. We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue. |
Yes, the signature changed a lot (sorry) and the code no longer uses xarray at all, in favour of direct JSON mangling. The docstring is up to date, though, and I think I know the specific things that don't work, listed in the description of the PR.
Certainly on the cards, and this is actively discussed for FITS, where uncompressed cingle chunks are common, but I didn't think that would be the case for HDF5. |
👍, opened fsspec/kerchunk#124 for that. |
Currently,
HDFReferenceRecipe
doesn't work properly with both a concat dim while merging multiple variables.I suspect that this is blocked by the upstream issue in Kerchunk: fsspec/kerchunk#106 (comment). Filing this to make sure we check back here when unblocked.
Here's the code for the recipe (runs in ~8s in the West Europe Azure region)
And to load it and the expected output from
xarray.open_mfsdatset
.Checking the output
It seems that the first variable,
clt
, is included. The second,hurs
is not present in the store / Dataset. Additionally, the one variable is twice as long in the time dimension as it should be.Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so
ds.clt[0, 0].compute()
is slow. That could be user error though.The text was updated successfully, but these errors were encountered: