Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
era5_ml_dve.cfg		era5_ml_dve.cfg
era5_ml_lnsp.cfg		era5_ml_lnsp.cfg
era5_ml_o3q.cfg		era5_ml_o3q.cfg
era5_ml_qrqs.cfg		era5_ml_qrqs.cfg
era5_ml_tw.cfg		era5_ml_tw.cfg
era5_ml_zs.cfg		era5_ml_zs.cfg
era5_pl_hourly.cfg		era5_pl_hourly.cfg
era5_sfc.cfg		era5_sfc.cfg
era5_sfc_cape.cfg		era5_sfc_cape.cfg
era5_sfc_cisst.cfg		era5_sfc_cisst.cfg
era5_sfc_pcp.cfg		era5_sfc_pcp.cfg
era5_sfc_rad.cfg		era5_sfc_rad.cfg
era5_sfc_soil.cfg		era5_sfc_soil.cfg
era5_sfc_tcol.cfg		era5_sfc_tcol.cfg
era5_sl_hourly.cfg		era5_sl_hourly.cfg
gcs_data_consistency_checker.py		gcs_data_consistency_checker.py

README.md

Ingesting ERA5

Here, we document steps for acquiring and pre-processing raw ERA5 data for cloud-optimization. In this directory, we've included configuration files to describe and expediently acquire the data.

Downloading raw data from Copernicus

All data can be ingested from Copernicus with google-weather-tools, specifically weather-dl (see weather-tools.readthedocs.io).

Pre-requisites:

Install the weather tools, with at least version 0.3.1:
```
pip install google-weather-tools>=0.3.1
```
Acquire one or more licenses from Copernicus.

Recommended: Download configs allow users to specify multiple API keys in a single data request via "parameter subsections". We highly recommend that institutions pool licenses together for faster downloads.
Set up a cloud project with sufficient permissions to use cloud storage (such as GCS) and a Beam runner (such as Dataflow).

Note: Other cloud systems should work too, such as S3 and Elastic Map Reduce. However, these are untested. If you experience an error here, please let us know by filing an issue.

Steps:

Update the parameters section of the desired config file, e.g. raw/era5_ml_dv.cfg, with the appropriate information.
1. First, update the target_path to point to the right cloud bucket.
2. Add one or more CDS API keys, as is described here.
(optional, recommended) Preview the download with a dry run:
```
weather-dl raw/era5_ml_dv.cfg --dry-run 
```
Once the config looks sounds, execute the download on your preferred Beam runner, for example, the Apache Spark runner. We ingested data with GCP's Dataflow runner, like so:
```
export PROJECT=<your-project-id>
export BUCKET=<your-gcs-bucket>
export REGION=us-central1
weather-dl raw/era5_mv_dv.cfg \
 --runner DataflowRunner \
 --project $PROJECT \
 --region $REGION \
 --temp_location "gs://$BUCKET/tmp/" \
 --disk_size_gb 75 \
 --job_name era5-ml-dv
```
If you'd like to download the data locally, you can run the following, though this isn't recommended (the data is large!):
```
weather-dl raw/era5_mv_dv.cfg --local-run
```
Check out the weather-dl docs for more information.
Repeat for the rest of the config files.

Preparing grib data for conversion

Grib is an idiosyncratic format. For example, a single grib file can contain multiple level types, standard table versions, or grids. This often makes grib files difficult to open. The system we've employed to convert data to Zarr, Pangeo Forge Recipes, is not (yet) able to handle this complexity. Thus, to prepare the raw data for conversion, we need to perform one additional processing step: splitting grib files by variable. This can be done with google-weather-tools, specifically weather-sp (see weather-tools.readthedocs.io).

The only datasets we needed to split by variable are soil and pcp, since they mix levels and table versions. These steps will prepare the data for conversion by scripts in the src/ directory.

Pre-requisites:

Install the weather tools, with at least version 0.3.0:
```
pip install google-weather-tools>=0.3.0
```
Acquire read access to the datasets (e.g. via era5_sfc_soil.cfg) from some cloud storage bucket.

Steps:

Preview the data split by running the following command. Make sure to change the file paths if the data locations differ.

export DATASET=soil
weather-sp --input-pattern "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/**/*_hres_$DATASET.grb2" \
  --output-template "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/{1}/{0}.grb2_{typeOfLevel}_{shortName}.grib" \
  --dry-run

Execute the data split on your preferred Beam runner. For example, here are the arguments to run the splitter on Dataflow:

export DATASET=soil

export PROJECT=<your-project>
export BUCKET=<your-bucket>
export REGION=us-central1

weather-sp --input-pattern "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/**/*_hres_$DATASET.grb2" \
  --output-template "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/{1}/{0}.grb2_{typeOfLevel}_{shortName}.grib" \
  --runner DataflowRunner \
  --project $PROJECT \
  --region $REGION \
  --temp_location gs://$BUCEKT/tmp \
  --disk_size_gb 100 \
  --job_name split-soil-data

Repeat this process, except change the dataset to pcp:
```
export DATASET=pcp
```

Data Validation Script

This script is designed to validate data files from the "gcp-public-data-arco-era5" Google Cloud Storage bucket. It ensures that the required files for specific years and reports all extra files which are in the bucket.

Pre-requisites:

Python 3.x installed
Required Python packages installed (e.g., fsspec, pandas)

Configuration:

You can modify the configuration in the script:

BUCKET: The name of the Google Cloud Storage bucket containing the data.
START_YEAR and END_YEAR: Define the range of years to validate.

Steps:

Run the script with Python:

python raw/gcs_data_consistency_checker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw

raw

README.md

Ingesting ERA5

Downloading raw data from Copernicus

Preparing grib data for conversion

Data Validation Script

Files

raw

Directory actions

More options

Directory actions

More options

Latest commit

History

raw

Folders and files

parent directory

README.md

Ingesting ERA5

Downloading raw data from Copernicus

Preparing grib data for conversion

Data Validation Script