-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply kerchunk to one-timestep-per-file (climate model) output #123
Comments
I do not see the variable/coordinate "time" in the input data |
I got the same error even with |
This just adds "time" to the list of dims but doesn't assign any values, so there's no way to determine what the output "time" coordinate should be.
where we use the |
great. #122 works for me. only the final step fails |
Yep, for now it returns the references for you to deal with - but it should write directly as the previous version did. |
After discussing with @d70-t, I found out that my climate model output is usually stored with no spatial chunks and the unlimited |
That's OK! See this branch, where I've made a start on netCDF3:
Can you share? |
I create from I would have hoped the Script: import os
import zipfile
import kerchunk.hdf
import fsspec
from timeit import default_timer as timer
import xarray as xr
import numpy as np
from kerchunk.combine import MultiZarrToZarr
import json
recreate=True
use_dummy=True # False
nvar=20
ntime=50
nlev=1
if use_dummy:
ds = xr.DataArray(name='var', data=np.random.random((220,256,12,nlev,nvar)), dims=['x','y','time','depth','var'], coords={'x':range(220), 'y':range(256), 'depth':range(nlev), 'var':[f'var{i}' for i in range(nvar)]}).to_dataset('var')
urls =[f'test_{y}.nc' for y in range(2000,2000+ntime)]
if recreate:
for i,u in enumerate(urls):
#ds = ds.assign_coords(time=xr.cftime_range(start=str(2000+i),freq='MS', periods=ds.time.size))
ds = ds.assign_coords(time=range(2000+i*12,2000+i*12+12))
ds.astype('float32').to_netcdf(u)
else:
# use MPIESM output
urls = [i for i in os.listdir('.') if 'asp_esmControl' in i]
start=timer()
so = dict(
anon=True, default_fill_cache=False, default_cache_type='first'
)
with zipfile.ZipFile("out.zip", mode="w") as zf:
for u in urls:
print(u)
with fsspec.open(u, **so) as inf:
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
with zf.open(os.path.basename(u) + ".json", 'w') as outf:
outf.write(json.dumps(h5chunks.translate()).encode())
remote_protocol="memory"
from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
"zip://*.json::out.zip",
remote_protocol=remote_protocol,
concat_dims=["time"],
#coo_map={"time": "INDEX"},
)
print(mzz)
test_dict = mzz.translate()
end = timer()
print('create json '+str(round(end-start,3))+'s')
#print(test_dict)
#mzz.translate("output.zarr")
# This can also be written as a json
#mzz.translate("output.json")
combine='nested'
concat_dim='time'
m = fsspec.get_mapper('reference://', fo=test_dict, remote_protocoll=remote_protocol)
ker = xr.open_dataset(m, engine='zarr', backend_kwargs={"consolidated": False})
x = xr.open_mfdataset(urls, concat_dim=concat_dim, combine=combine)
print(f'{ker.sizes} {ker.nbytes*1e-9} GB')
print(f'{x.sizes} {x.nbytes*1e-9} GB')
def shuffle(ds):
idx = np.arange(ds.time.size)
np.random.shuffle(idx)
return (ds.isel(time=idx)*ds).mean('time')
# time xr.open_mfdataset(, engine='zarr')
print('start shuffle(xr.open_dataset(json, engine="zarr")).compute()')
start = timer()
#m = fsspec.get_mapper('reference://', fo=test_dict, remote_protocoll=remote_protocol)
shuffle(xr.open_dataset(m, engine='zarr', backend_kwargs={"consolidated": False, "storage_options": {"fo":test_dict, "remote_protocol":remote_protocol}})).compute()
end = timer()
print('took '+str(round(end-start,3))+'s')
combine='nested'
concat_dim='time'
# time open xr.open_mfdataset(urls)
print('start shuffle(xr.open_mfdataset("*").compute()')
start = timer()
shuffle(xr.open_mfdataset(urls, concat_dim=concat_dim, combine=combine)).compute()
end = timer()
print('took '+str(round(end-start,3))+'s')
print('start shuffle(xr.open_mfdataset("*", parallel=True, coords="minimal", data_vars="minimal", compat="override").mean().compute()')
start = timer()
shuffle(xr.open_mfdataset(urls, combine=combine, concat_dim=concat_dim, parallel=True, coords="minimal", data_vars="minimal", compat="override")).compute()
end = timer()
print('took '+str(round(end-start,3))+'s')
print('start shuffle(xr.open_mfdataset("*", parallel=False, coords="minimal", data_vars="minimal", compat="override").mean().compute()')
start = timer()
shuffle(xr.open_mfdataset(urls, concat_dim=concat_dim, combine=combine, parallel=False, coords="minimal", data_vars="minimal", compat="override")).compute()
end = timer()
print('took '+str(round(end-start,3))+'s')
print("")
print("end") Output:
|
The biggest difference here, is that the kerchunk-zarr dataset is not being opened with dask. You need to add
So with default threaded Dask, kerchunk is somewhat slower (I suspect xarray is doing some caching of its own here, cc @rabernat ?) but with distributed it is somewhat faster (I suspect distributed's cache is efficient here). Kerchunk is always much faster to open the dataset, which is a small part of the overall cost in this case. Note that kerchunk allows you to choose chunk sizes bigger that the native ones in the data (e.g., |
MultiZarrToZarr.translate
: TypeError
: shape is None from zarr
On second thoughts, since the data is uncompressed, it may well be that h5py can read it directly without copy and without the GIL, whereas kerchunk/referenceFS deals with bytes. We could implement opening as mmap and passing around |
kerchunk
to open climate model netcdf output stored 1 or 12 timestep/s per file in many files as one hugezarr
.nc
files tozarr
, because I dont have the storage to keep both and need to keepnc
.kerchunk
could help me.I wrote a small example below but it fails with an
zarr
shape error which I do not understand. I getSingleHdf5ToZarr
running, butMultiZarrToZarr
fails. Does anyone see a fix for this? Is this example supposed to work?Output
Versions:
The text was updated successfully, but these errors were encountered: