Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

directory search order for CMIP data sets #3162

Closed
rswamina opened this issue May 2, 2023 · 5 comments · Fixed by #3322
Closed

directory search order for CMIP data sets #3162

rswamina opened this issue May 2, 2023 · 5 comments · Fixed by #3322

Comments

@rswamina
Copy link
Contributor

rswamina commented May 2, 2023

Hi,

I was testing the recipe, recipe_python.yml for the proposed ESMVAlTool tutorial this week on the JASMIN HPC and noticed that one of the datasets necessary for this recipe is possibly corrupted on CEDA's badc archive. I have reported this to CEDA but have a question on how to work around this error. Normally, if we find issues with the data, we would download the data we need and run our recipes. However, in this case, because the file exists on the badc archive, ESMValTool repeatedly finds that file on badc first instead of the one I downloaded and quits with the error message.

The error message is :

OSError: [Errno -51] NetCDF: Unknown file format: b'/badc/cmip5/data/cmip5/output1/BCC/bcc-csm1-1/historical/fx/atmos/fx/r0i0p0/v20110101/areacella/areacella_fx_bcc-csm1-1_historical_r0i0p0.nc'

The following changes in the config-user.yml file to the CMIP5 path on JASMIN and will work around the error:
I could not add my download directory and the badc directory to the list because they have different path formats - ESGF and BADC.

# Site-specific entries: JASMIN
# Uncomment the lines below to locate data on JASMIN.
auxiliary_data_dir: /gws/nopw/j04/esmeval/aux_data/AUX
rootpath:
  CMIP6: /badc/cmip6/data/CMIP6
  CMIP5: /gws/nopw/j04/esmeval/esmvaltool_tutorial_test_May_2023/ #/badc/cmip5/data/cmip5/output1
  CMIP3: /badc/cmip3_drs/data/cmip3/output
  OBS: /gws/nopw/j04/esmeval/obsdata-v2
  OBS6: /gws/nopw/j04/esmeval/obsdata-v2
  obs4MIPs: /gws/nopw/j04/esmeval/obsdata-v2
  ana4mips: /gws/nopw/j04/esmeval/obsdata-v2
  CORDEX: /badc/cordex/data/CORDEX/output
drs:
  CMIP6: BADC
  CMIP5: ESGF #BADC
  CMIP3: BADC
  CORDEX: BADC
  OBS: default
  OBS6: default
  obs4MIPs: default
  ana4mips: default

I suspect that this happens because, the JASMIN path is searched by default. If the specified file is not found, esmvaltool will look in the download directory but if the file exists and is corrupted, it will not look further. Is my understanding correct and is this the only workaround? The tutorial is on Thursday and it will be good to know how best to handle it by then if possible.
Tagging both @valeriupredoi and @bouweandela for a response due to the time constraint. Apologies for the rush.
Thanks!
-Ranjini

@bouweandela
Copy link
Member

bouweandela commented May 2, 2023

If the person in charge of administrating the CMIP5 data on Jasmin does not have time to remove the corrupted file before the tutorial, I would recommend creating a copy of the recipe specifically for the tutorial and change it so it uses a different dataset.

@rswamina
Copy link
Contributor Author

rswamina commented May 3, 2023

@valeriupredoi - I am posting the update from the CEDA help desk:

It seems that the v20110101 version of the dataset contains empty files for all fields:

-rw-r----- 1 badc open 0 Jan 13 2012 areacella/areacella_fx_bcc-csm1-1_historical_r0i0p0.nc
-rw-r----- 1 badc open 0 Jan 13 2012 orog/orog_fx_bcc-csm1-1_historical_r0i0p0.nc
-rw-r----- 1 badc open 0 Jan 13 2012 sftlf/sftlf_fx_bcc-csm1-1_historical_r0i0p0.nc

That version also does not exist on any other ESGF sites, so there is nowhere to get a good copy from. Also, this dataset is marked as complete in the CEDA archive so can't be removed or changed.

Instead, the version called v1 is in fact the latest version, and you will see this from the latest symlink:

/badc/cmip5/data/cmip5/output1/BCC/bcc-csm1-1/historical/fx/atmos/fx/r0i0p0/latest -> v1

Looking inside the v1 directory, the checksums match the copies on other ESGF sites.

So the full path to the areacella file you need is:

/badc/cmip5/data/cmip5/output1/BCC/bcc-csm1-1/historical/fx/atmos/fx/r0i0p0/v1/areacella/areacella_fx_bcc-csm1-1_historical_r0i0p0.nc

@bouweandela
Copy link
Member

bouweandela commented May 4, 2023

In that case, you may be able to solve the problem by specifying the version in the recipe. Can you try replacing

datasets:
  - {dataset: bcc-csm1-1, project: CMIP5, exp: historical, ensemble: r1i1p1}

by

datasets:
  - {dataset: bcc-csm1-1, version: v1, project: CMIP5, exp: historical, ensemble: r1i1p1}

@rswamina
Copy link
Contributor Author

rswamina commented May 4, 2023

Hi @bouweandela - This works too. Thanks!

However, in the future if a file gets corrupted on a machine or is empty for whatever reason, should we have the option of choosing our local copy? Can we force the option of picking the downloaded copy or is that not a desirable feature/option to have?

@bouweandela
Copy link
Member

The idea of being able to blacklist files that are somehow wrong but cannot be controlled by the user has been discussed before, though I cannot find back the issue now. It seems useful, maybe it could be implemented as part of a wider effort to improve how the data finding is configured. See ESMValGroup/ESMValCore#1894 (comment) for a proposal and some previous discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants