Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to increase the speed of saving ERA5 chunks data? #65

Open
yangxianke opened this issue Dec 7, 2023 · 3 comments
Open

How to increase the speed of saving ERA5 chunks data? #65

yangxianke opened this issue Dec 7, 2023 · 3 comments

Comments

@yangxianke
Copy link

yangxianke commented Dec 7, 2023

Hi, everyone. It is really convenient to access ERA5 data from cloud storage. However, it's very slow to save the processed data as netcdf format. It has taken 40 minutes so far and still has not been saved successfully. How can I solve this problem and increase the speed of saving ERA5 chunks data?This is my code.

import xarray as xr
reanalysis = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2', 
    chunks={'time': 48},
    consolidated=True)

## data_CN Size: 119G
data_CN = reanalysis["2m_temperature"].loc["1961-01-01":'2020-12-31',60:0,70:130]  

## data_CN_daily Size: 159M
data_CN_daily = data_CN.resample(time="M").mean()

data_CN_daily = data_CN_daily.compute()    ## This process will cost long time (at least 50min+)

data_CN_daily.to_netcdf("data_ch.cn")
@yangxianke yangxianke changed the title How to increase the speed of storing ERA5 chunks data? How to increase the speed of saving ERA5 chunks data? Dec 7, 2023
@thorinf-orca
Copy link

subset = ds[var].sel(latitude=latitude_slice, longitude=longitude_slices)
path = osp.join(out_dir, f"{var}_{region}.zarr")
mode = 'a' if osp.exists(path) and os.listdir(path) else 'w'
subset.to_zarr(path, mode=mode, compute=True)

I'm trying something similar, but noticing a very slow increase in my output zarr footprint, 0.5MB/s at best.

@shoyer
Copy link
Collaborator

shoyer commented Jun 14, 2024

The problem is that the data is stored in a way that only makes it efficient to access all locations at once. If you slice out a small area and load all times, you are effectively loading data for the entire globe.

To fix this, you could use a tool like "rechunker" to convert these arrays into a format that allows for efficient queries across time: https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11

@chudlerk
Copy link

The problem is that the data is stored in a way that only makes it efficient to access all locations at once. If you slice out a small area and load all times, you are effectively loading data for the entire globe.

To fix this, you could use a tool like "rechunker" to convert these arrays into a format that allows for efficient queries across time: https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11

Could you provide an example of the optimal way to do this? Lets say I just need data at one latitude/longitude/level, but for the entire record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants