How to increase the speed of saving ERA5 chunks data？ #65

yangxianke · 2023-12-07T13:36:18Z

Hi, everyone. It is really convenient to access ERA5 data from cloud storage. However, it's very slow to save the processed data as netcdf format. It has taken 40 minutes so far and still has not been saved successfully. How can I solve this problem and increase the speed of saving ERA5 chunks data？This is my code.

import xarray as xr
reanalysis = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2', 
    chunks={'time': 48},
    consolidated=True)

## data_CN Size: 119G
data_CN = reanalysis["2m_temperature"].loc["1961-01-01":'2020-12-31',60:0,70:130]  

## data_CN_daily Size: 159M
data_CN_daily = data_CN.resample(time="M").mean()

data_CN_daily = data_CN_daily.compute()    ## This process will cost long time (at least 50min+)

data_CN_daily.to_netcdf("data_ch.cn")

The text was updated successfully, but these errors were encountered:

thorinf-orca · 2024-02-13T16:44:34Z

subset = ds[var].sel(latitude=latitude_slice, longitude=longitude_slices)
path = osp.join(out_dir, f"{var}_{region}.zarr")
mode = 'a' if osp.exists(path) and os.listdir(path) else 'w'
subset.to_zarr(path, mode=mode, compute=True)

I'm trying something similar, but noticing a very slow increase in my output zarr footprint, 0.5MB/s at best.

shoyer · 2024-06-14T16:12:03Z

The problem is that the data is stored in a way that only makes it efficient to access all locations at once. If you slice out a small area and load all times, you are effectively loading data for the entire globe.

To fix this, you could use a tool like "rechunker" to convert these arrays into a format that allows for efficient queries across time: https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11

chudlerk · 2024-06-26T16:52:11Z

The problem is that the data is stored in a way that only makes it efficient to access all locations at once. If you slice out a small area and load all times, you are effectively loading data for the entire globe.

To fix this, you could use a tool like "rechunker" to convert these arrays into a format that allows for efficient queries across time: https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11

Could you provide an example of the optimal way to do this? Lets say I just need data at one latitude/longitude/level, but for the entire record.

yangxianke changed the title ~~How to increase the speed of storing ERA5 chunks data？~~ How to increase the speed of saving ERA5 chunks data？ Dec 7, 2023

kysolvik mentioned this issue Jun 21, 2024

Reducing size of ERA5 example StevePny/DataAssimBench-Examples#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to increase the speed of saving ERA5 chunks data？ #65

How to increase the speed of saving ERA5 chunks data？ #65

yangxianke commented Dec 7, 2023 •

edited

Loading

thorinf-orca commented Feb 13, 2024

shoyer commented Jun 14, 2024

chudlerk commented Jun 26, 2024

How to increase the speed of saving ERA5 chunks data？ #65

How to increase the speed of saving ERA5 chunks data？ #65

Comments

yangxianke commented Dec 7, 2023 • edited Loading

thorinf-orca commented Feb 13, 2024

shoyer commented Jun 14, 2024

chudlerk commented Jun 26, 2024

yangxianke commented Dec 7, 2023 •

edited

Loading