Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CMEMS download performance #1033

Open
2 tasks
veenstrajelmer opened this issue Oct 23, 2024 · 0 comments
Open
2 tasks

Improve CMEMS download performance #1033

veenstrajelmer opened this issue Oct 23, 2024 · 0 comments

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Oct 23, 2024

Downloading long timeseries for CMEMS is slow with dfm_tools, even though the actual download happens with a daily frequency. This is probably since per default the entire requested dataset is opened, from which then daily subsets are retrieved:

dataset = copernicusmarine.open_dataset(
dataset_id = dataset_id,
variables = [varkey],
minimum_longitude = longitude_min,
maximum_longitude = longitude_max,
minimum_latitude = latitude_min,
maximum_latitude = latitude_max,
start_datetime = date_min,
end_datetime = date_max,
)
Path(dir_output).mkdir(parents=True, exist_ok=True)
if freq is None:
date_str = f"{date_min.strftime('%Y%m%d')}_{date_max.strftime('%Y%m%d')}"
name_output = f'{file_prefix}{varkey}_{date_str}.nc'
output_filename = Path(dir_output,name_output)
if output_filename.is_file() and not overwrite:
print(f'"{name_output}" found and overwrite=False, returning.')
return
print(f'xarray writing netcdf file: {name_output}')
dataset.to_netcdf(output_filename)
else:
period_range = pd.period_range(date_min,date_max,freq=freq)
for date in period_range:
date_str = str(date)
name_output = f'{file_prefix}{varkey}_{date_str}.nc'
output_filename = Path(dir_output,name_output)
if output_filename.is_file() and not overwrite:
print(f'"{name_output}" found and overwrite=False, continuing.')
continue
dataset_perperiod = dataset.sel(time=slice(date_str, date_str))
print(f'xarray writing netcdf file: {name_output}')
dataset_perperiod.to_netcdf(output_filename)

This example shows that when cutting it up in monthly chunks, the download is way faster compared to retrieving at once:

import dfm_tools as dfmt
import pandas as pd

# spatial extents
lon_min, lon_max, lat_min, lat_max = 12.5, 16.5, 34.5, 37

# time extents
date_min = '2017-12-01'
date_max = '2022-07-31'

# make list of start/stop times (tuples) with monthly frequency
# TODO: this approach improves performance significantly
date_range_start = pd.date_range(start=date_min, end=date_max, freq='MS')
date_range_end = pd.date_range(start=date_min, end=date_max, freq='ME')
monthly_periods = [(start, end) for start, end in zip(date_range_start, date_range_end)]

# make list of start/stop times (tuples) to download all at once (but still per day)
# TODO: this is the default behaviour and is slow
monthly_periods = [(date_min, date_max)]

for period in monthly_periods: 
    dfmt.download_CMEMS(varkey='uo',
                        longitude_min=lon_min, longitude_max=lon_max, latitude_min=lat_min, latitude_max=lat_max,
                        date_min=period[0], date_max=period[1],
                        dir_output=".", overwrite=True, dataset_id='med-cmcc-cur-rean-d')

Todo:

This was referenced Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant