Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address "flaky" ts subtask in complete_run test #549

Closed
3 tasks done
forsyth2 opened this issue Feb 17, 2024 · 39 comments
Closed
3 tasks done

Address "flaky" ts subtask in complete_run test #549

forsyth2 opened this issue Feb 17, 2024 · 39 comments
Labels
Testing Files in `tests` modified

Comments

@forsyth2
Copy link
Collaborator

Request criteria

  • I searched the zppy GitHub Discussions to find a similar question and didn't find it.
  • I searched the zppy documentation.
  • This issue does not match the other templates (i.e., it is not a bug report, documentation request, feature request, or a question.)

Issue description

The complete_run integration test is "flaky" (e.g., as in https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html). Specifically, the [[ ts ]] > [[ land_monthly ]] subtask.

I was able to get the test passing while testing #424. However, it appears I never ran the test straight-through and instead did several re-runs, which actually had the effect of creating a different final output than if it had been run straight through.

I came across this while testing #548. The subtask was suddenly failing despite no relevant changes being made. I then tested on the main branch (specifically at a902549) and encountered the same errors. I found the following.

Run 1

Relevant sections from tests/integration/generated/test_complete_run_chrysalis.cfg:

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/test-main2-20240216"

[ts]
active = True
walltime = "00:30:00"
years = "1850:1854:2",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,RH2M"
  ts_fmt = "cmip"

[ilamb]
active = True
grid = '180x360_aave'
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
years = "1850:1854:2",

Then:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
ilamb_1850-1851.status:WAITING 472320
ilamb_1852-1853.status:WAITING 472321
ts_land_monthly_1850-1851-0002.status:ERROR (4)
ts_land_monthly_1852-1853-0002.status:ERROR (4)
# The ILAMB tasks are waiting for the failed `ts_land_monthly` jobs.
$ tail -n 16 ts_land_monthly_1850-1851-0002.o472316 
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975_975:WARNING:derive_handlers:No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975_975:INFO:_get_handlers:| Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975_975:ERROR:_get_handlers:No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
srun: error: chr-0503: task 0: Exited with exit code 1
$ tail -n 16 ts_land_monthly_1852-1853-0002.o472317 
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975_975:WARNING:derive_handlers:No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975_975:INFO:_get_handlers:| Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975_975:ERROR:_get_handlers:No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
srun: error: chr-0504: task 0: Exited with exit code 1

Run 2

I changed vars = "FSH,RH2M" to vars = "LAISHA,LAISUN" and re-ran (without deleting the output directory, meaning only failing jobs were re-ran).

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
ilamb_1850-1851.status:WAITING 472320
ilamb_1852-1853.status:WAITING 472321
ts_land_monthly_1850-1851-0002.status:ERROR (5)
# The ILAMB tasks didn't run because `zppy` ignores "WAITING" tasks.
# ts_land_monthly_1850-1851-0002 succeeds now!
$ tail -n 30 ts_land_monthly_1850-1851-0002.o472322
2024-02-17 00:51:37,787 [INFO]: handler.py(cmorize:209) >> lai: Starting CMORizing
2024-02-17 00:51:37,787 [INFO]: handler.py(cmorize:209) >> lai: Starting CMORizing
2024-02-17 00:51:37,787_787:INFO:cmorize:lai: Starting CMORizing
  0%|          | 0/1 [00:00<?, ?it/s]2024-02-17 00:51:37,974 [INFO]: handler.py(_setup_cmor_module:333) >> lai: CMOR setup complete
2024-02-17 00:51:37,974 [INFO]: handler.py(_setup_cmor_module:333) >> lai: CMOR setup complete
2024-02-17 00:51:37,974_974:INFO:_setup_cmor_module:lai: CMOR setup complete
2024-02-17 00:51:37,975 [INFO]: handler.py(cmorize:239) >> lai: loading E3SM variables dict_keys(['LAISHA', 'LAISUN'])
2024-02-17 00:51:37,975 [INFO]: handler.py(cmorize:239) >> lai: loading E3SM variables dict_keys(['LAISHA', 'LAISUN'])
2024-02-17 00:51:37,975_975:INFO:cmorize:lai: loading E3SM variables dict_keys(['LAISHA', 'LAISUN'])
2024-02-17 00:51:38,180 [INFO]: handler.py(cmorize:247) >> lai: creating CMOR variable with CMOR axis objects.
2024-02-17 00:51:38,180 [INFO]: handler.py(cmorize:247) >> lai: creating CMOR variable with CMOR axis objects.
2024-02-17 00:51:38,180_180:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 912, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
100%|██████████| 1/1 [00:00<00:00,  2.03it/s]
2024-02-17 00:51:38,329 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329_329:INFO:_run_parallel:0 of 1 handlers complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-17 00:51:38,329_329:ERROR:_run_parallel:lai failed to complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329_329:ERROR:_run_parallel:0 of 1 handlers complete
'LAISUN'
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

A similar cannot stat with CMIP6/CMIP line appears in issues #523 (comment), #543 and in PRs #424 (comment), #533, #534 (comment).

Run 3

I re-ran (without deleting the output directory, meaning only failing jobs were re-ran), without changing anything.

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
ilamb_1850-1851.status:WAITING 472320
ilamb_1852-1853.status:WAITING 472321
# The second `ts_land_monthly` subtask worked this time!

Run 4

I deleted the ilamb status files and re-ran (without deleting the output directory, meaning only the status-file-missing ILAMB jobs were re-ran)

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No failures! ILAMB worked too, now that the ts_land_monthly dependencies succeeded.

Conclusions

The ts_land_monthly subtask is flaky. Looking at https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html, it is flaky in the sense that there is a state leak.

=> While it is useful to have zppy only re-run the failed jobs, this does not in fact guarantee that all jobs would have succeeded had the latest cfg been used from the start!

=> All pre-PR-merging zppy runs should use a clean output directory, in order to produce valid integration test results.

Remaining questions

  1. So, what went wrong with testing Native EAMxx support #424? Setting vars = "FSH,LAISHA,LAISUN,RH2M" caused an error (Native EAMxx support #424 (comment)) which was actually equivalent to the error in Run 2 above. Rather than re-running, I removed the erroring variables LAISHA,LAISUN (Native EAMxx support #424 (comment)) and got the job to work. But that doesn't make sense because that means Native EAMxx support #424 testing worked with vars = "FSH,RH2M", which clearly causes errors in Run 1 above.

  2. Why did Run 3 work simply by re-running without changing anything? I have a vague memory of there potentially being some sort of race or lock condition in accessing the files in post/.... Could that be what's going on? Could the two ts_land_monthly jobs possibly be interacting with each other in any way on Run 1?

@forsyth2 forsyth2 added the Testing Files in `tests` modified label Feb 17, 2024
@forsyth2
Copy link
Collaborator Author

@tomvothecoder Do you happen to have any thoughts on this? Unfortunately, integration tests are more state-dependent than unit tests, and on top of that, zppy itself is a highly state-dependent tool.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 17, 2024

Re: remaining question (2), seeing concurrent/futures/_base.py in the Run 2 stack trace indeed seems to point to signs of a race condition somehow...

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Feb 20, 2024

@forsyth2 I'm trying to study what is a flaky test. It is for repeating a same test many times, but get different results. In your cases, among the 4 runs, which are the same tests? I might be a little confused.

@forsyth2
Copy link
Collaborator Author

I suppose I'm using the term somewhat loosely. My point is that the complete_run test can be made to complete successfully if the cfg is run multiple times, but will fail if run only once. Notably, nothing changes in the cfg at all for run 2->3 and run 3->4.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 I think I found a bug when testing the configuration file copied from #549 (comment).
With this configuration file, I got ilamb jobs skipped, because ts_atm_monthly_180x360_aave_1850-1851-0002.status status files are missing. By including an atm ts task would bypass this error. But Ilamb run should not require atm variables.

Configuration file validation passed.
ts_land_monthly_1850-1851-0002
...Submitted batch job 474065
ts_land_monthly_1852-1853-0002
...Submitted batch job 474066
ilamb_1850-1851
...skipping because of dependency status file missing
   /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try4/post/scripts/ts_atm_monthly_180x360_aave_1850-1851-0002.status
URL: https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.zhang40/zppy_test_complete_run_www/try4/v2.LR.historical_0201/ilamb
ilamb_1852-1853
...skipping because of dependency status file missing
   /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try4/post/scripts/ts_atm_monthly_180x360_aave_1850-1851-0002.status
URL: https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.zhang40/zppy_test_complete_run_www/try4/v2.LR.historical_0201/ilamb

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Feb 20, 2024

@forsyth2 I think I figured what caused the flaky test documented here and potentially the bundle run test documented in
#543.

This is a bug in setting up e3sm_to_cmip runs in zppy. For e3sm_to_cmip it expects a single set of time series variables. While in this test, two sets of time series are presented in one file directory, EX

LAISUN_185201_185312.nc
LAISHA_185201_185312.nc
LAISUN_185001_185112.nc
LAISHA_185001_185112.nc

Whether the e3sm_to_cmip tasks fail or not, depends on the timing of the time series being generated. This bug is only revealed in the complete tests with this type of setting: years = "1850:1854:2". Running with a single time series will work just fine. ex years = "1850:1854:4"`.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 21, 2024

@chengzhuzhang Thank you for identifying the problem. We'll have to set up separate subdirectories to avoid this.

$ git grep -n "cmip_ts" *
zppy/templates/ilamb.bash:49:lnd_ts_for_ilamb={{ output }}/post/lnd/180x360_aave/cmip_ts/monthly/
zppy/templates/ilamb.bash:50:atm_ts_for_ilamb={{ output }}/post/atm/180x360_aave/cmip_ts/monthly/
zppy/templates/ts.bash:145:  dest_cmip={{ output }}/post/{{ component }}/{{ grid }}/cmip_ts/{{ frequency }}

I think we would add a directory after cmip_ts, maybe the years, e.g., "1850-1851". Then the ILAMB task would be able to pick out the correct subdirectory too.

Note: another option is to force ts tasks using ts_fmt = "cmip" to run sequentially. But I don't think that's an ideal option. It's what we did with tc_analysis in

zppy/zppy/tc_analysis.py

Lines 88 to 91 in a902549

# Note that this line should still be executed even if jobid == -1
# The later tc_analysis tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(statusFile)

and

zppy/zppy/e3sm_diags.py

Lines 244 to 250 in a902549

# Due to a `socket.gaierror: [Errno -2] Name or service not known` error when running e3sm_diags with tc_analysis
# on multiple year_sets, if tc_analysis is in sets, then e3sm_diags should be run sequentially.
if "tc_analysis" in c["sets"]:
# Note that this line should still be executed even if jobid == -1
# The later tc_analysis-using e3sm_diags tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(statusFile)

@chengzhuzhang
Copy link
Collaborator

I tried out the solution to save data separately, unfortunately the problem persist...

@chengzhuzhang
Copy link
Collaborator

And I can provide a minimum example that runs e3sm_to_cmip standalone, but won't be able to reproduce the errors.

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12

@forsyth2
Copy link
Collaborator Author

I tried out the solution to save data separately, unfortunately the problem persist...

I tried as well and indeed it persists.

Unfortunately, the concurrency error message from e3sm_to_cmip is not terribly helpful. Would we be ok with forcing the ts tasks to run sequentially? The ts tasks are not a significant time sink, unlike say E3SM Diags, so I don't think it would greatly increase zppy's run time.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 21, 2024

Actually, I'm not even sure running sequentially will help. I tried rerunning zppy and still get the same error. Then, I tried running with sbatch directly (which worked in #548 (comment)), but it failed this time.

Is it possible the error is in e3sm_to_cmip's handling of concurrency, rather than how zppy calls e3sm_to_cmip?

@chengzhuzhang
Copy link
Collaborator

@tomvothecoder could you please help look at this issue. At this point, we think the zppy issue is caused by how e3sm_to_cmip handles concurrency. The error can't be reproduced with single e3sm_to_cmip runs.

@tomvothecoder
Copy link
Collaborator

@tomvothecoder could you please help look at this issue. At this point, we think the zppy issue is caused by how e3sm_to_cmip handles concurrency. The error can't be reproduced with single e3sm_to_cmip runs.

Sure I can take a look. What is the priority for this issue and did we want to get it fixed before E3SM-Unified 1.9.3? If I understand correctly, E3SM-Unified 1.9.3 will probably be delayed for further zppy testing and debugging.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Feb 21, 2024

@tomvothecoder thanks. Yes, addressing this issue would be a high priority for the E3SM-Unified 1.9.3 release. Ryan and I have minimal examples to test e3sm_to_cmip standalone or through zppy. We can chat off line for speeding up testing.

@forsyth2
Copy link
Collaborator Author

Thanks @tomvothecoder!!

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Feb 21, 2024

@tomvothecoder thanks. Yes, addressing this issue would be a high priority for the E3SM-Unified 1.9.3 release. Ryan and I have minimal examples to test e3sm_to_cmip standalone or through zppy. We can chat off line for speeding up testing.

Got it, I'll focus on it tomorrow and let you know how it goes. I do have AWS on Friday, so hopefully I make some good progress by then.

@forsyth2
Copy link
Collaborator Author

@tomvothecoder If it's useful, I tried doing some more debugging in my testing-updates PR (#554), but it's mostly just giving the same errors I've already described in this issue.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Feb 22, 2024

And I can provide a minimum example that runs e3sm_to_cmip standalone, but won't be able to reproduce the errors.

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12

@chengzhuzhang What is the absolute directory for ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json? I'm making an e3sm_to_cmip .py script for debugging.

EDIT: Nevermind that's just the file in the zppy repo: https://github.com/E3SM-Project/zppy/blob/main/zppy/templates/e3sm_to_cmip/default_metadata.json, which looks like the same thing as what is in the e3sm_to_cmip repo https://github.com/E3SM-Project/e3sm_to_cmip/blob/master/e3sm_to_cmip/resources/default_metadata.json

@chengzhuzhang
Copy link
Collaborator

@tomvothecoder yes! It's my home directory where I git cloned zppy. And the copy was originally taken from e3sm_to_cmip.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Feb 22, 2024

And I can provide a minimum example that runs e3sm_to_cmip standalone, but won't be able to reproduce the errors.

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12

@chengzhuzhang just like you, I can't reproduce this issue running e3sm_to_cmip directly. Both the serial and parallel runs of my test script succeed. I can't find the root cause since this issue happens through zppy. I might have a theory and will push something in e3sm_to_cmip to see if it fixes anything.

My Analysis

I noticed that e3sm_to_cmip failures in zppy occur after this log line: 2024-02-20 22:04:45,724_724:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.. This specific log line occurs after opening multi-file input datasets (.nc files). Here are the related lines of code in e3sm_to_cmip.

The errors mentioned below are related to I/O. Specifically, one of them is raised by Xarray.

RE: NetCDF: Not a valid ID error from this comment

2024-02-20 22:04:45,965_965:ERROR:_run_parallel:0 of 1 handlers complete
NetCDF: Not a valid ID
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

I've seen this error before while testing xCDAT by looping over large multi-file datasets and applying computations on them.
This error seemed to appear at random (similar to how this test is "flaky"). Jason mentioned it could be Xarray not behaving well with the filesystem.

Possible cause: Maybe there are multiple processes trying to access the same input .nc file(s) at the same time and/or the input file(s) are closed too early or not at all?

Related Xarray GitHub comment about this issue:

When you use a with block with an xarray object, it closes any associated netCDF file when you leave the block. When you type dataset.mean(dim='round') the file is already closed.

For interactive use, it's usually preferable to just write dataset = xarray.open_dataset('snellman.nc') and close it explicitly later when/if desired.

Possible solution: Explicitly closing the xr.Dataset object after each loop in this code block

RE: "Run 2" in the PR description

2024-02-17 00:51:38,329_329:ERROR:_run_parallel:0 of 1 handlers complete
'LAISUN'
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

Possible cause: I think this error happens because e3sm_to_cmip fails to call cmor.write() to write the output .nc file(s) needed for this command.

Possible solution: Maybe same as above solution

@tomvothecoder
Copy link
Collaborator

I just merged E3SM-Project/e3sm_to_cmip#244. Is there any way to test zppy with the latest master version of e3sm_to_cmip? That way we don't need to make a release for an unconfirmed fix.

@forsyth2
Copy link
Collaborator Author

Yes, it's possible but annoying. (Because e3sm_to_cmip is not called directly as a task, but rather through the ts task, environment_commands can't be used directly to just run a later version of e3sm_to_cmip. Rather, the bash files have to be manually edited. This tech debt is tracked as #467, but unfortunately I don't expect that to be addressed any time soon).

In any case, I will try to test it out. Thanks for working on this.

@forsyth2
Copy link
Collaborator Author

Actually looking at c939071, I suppose the easier-to-implement alternative is to just keep e3sm_to_cmip_environment_commands as a parameter. That somewhat pollutes the parameter space, but I guess that's ok.

@forsyth2
Copy link
Collaborator Author

@tomvothecoder I'm testing your changes over on #556, but I'm getting the following cdms2 error.

$ tail -n 40 ts_land_monthly_1850-1851-0002.o475528 
The pfull variable handler requires both cdms2 and cdutil
Traceback (most recent call last):
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/bin/e3sm_to_cmip", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/__main__.py", line 964, in main
    app = E3SMtoCMIP(args)
          ^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/__main__.py", line 154, in __init__
    self.handlers = self._get_handlers()
                    ^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/__main__.py", line 210, in _get_handlers
    handlers = derive_handlers(
               ^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 176, in derive_handlers
    handlers_by_var: Dict[str, List[Dict[str, Any]]] = _get_handlers_by_var()
                                                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 269, in _get_handlers_by_var
    handlers_from_modules = _get_handlers_from_modules(LEGACY_HANDLER_DIR_PATH)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 365, in _get_handlers_from_modules
    module = _get_handler_module(var, filepath)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 404, in _get_handler_module
    module = SourceFileLoader(module_name, module_path).load_module()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap_external>", line 605, in _check_name_wrapper
  File "<frozen importlib._bootstrap_external>", line 1120, in load_module
  File "<frozen importlib._bootstrap_external>", line 945, in load_module
  File "<frozen importlib._bootstrap>", line 290, in _load_module_shim
  File "<frozen importlib._bootstrap>", line 721, in _load
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/vars/pfull.py", line 20, in <module>
    raise e
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/vars/pfull.py", line 16, in <module>
    import cdms2
ModuleNotFoundError: No module named 'cdms2'
srun: error: chr-0497: task 0: Exited with exit code 1

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Feb 23, 2024

@forsyth2 I believe cdms2 is no longer a e3sm_to_cmip dependency. By looking at the e3sm_to_cmip source code, i think somehow you are using an old version. Could you maybe try install from fresh?

I found some relevant issues: E3SM-Project/e3sm_to_cmip#115

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 23, 2024

@chengzhuzhang Yes, that's what I thought as well, but I just created a new dev environment off e3sm_to_cmip's latest master. I'll check again, maybe something went wrong with the environment creation.

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang, No I'm still definitely getting the The pfull variable handler requires both cdms2 and cdutil error.

$ git checkout master
$ git fetch upstream
$ git reset --hard upstream/master
$ git log # Check Tom's latest commit is present
$ mamba clean --all
$ conda env create -f conda-env/dev.yml -n e3sm_to_cmip_20240223 # Did this with `mamba` yesterday
$ conda activate e3sm_to_cmip_20240223 
$ python -m pip install . # Did this with `pip install .` yesterday

That seems to match the directions at https://github.com/E3SM-Project/e3sm_to_cmip?tab=readme-ov-file#2-conda-development-environment-and-source-code. I was reading the thread on #115, but shouldn't that all be resolved now?

@chengzhuzhang
Copy link
Collaborator

@forsyth2 I also tried to create a new env and install e3sm_to_cmip. By testing a standalone run, can't reproduce the error:

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang Interesting, that command also causes the cdms2 error for me.

@forsyth2
Copy link
Collaborator Author

I wonder if it's possibly related to #490. That's the only thing I can think of re: things working for other people but not me.

@chengzhuzhang
Copy link
Collaborator

@tomvothecoder added a Makefile earlier to help make sure a clean installation that make install will clean up cache files and build files before running python -m pip install . – ensures you are installing the latest version of the branch into your dev env. Can you try make install before running python -m pip install .?

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 23, 2024

I wonder if it's possibly related to #490.

I don't think it's that, since that seems to be related to the zppy templateDir, and the problem exists for me using e3sm_to_cmip stand-alone.

Can you try make install before running python -m pip install .?

Yes, let me try that.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 23, 2024

@chengzhuzhang thanks, the make install was indeed the missing step. That stand-alone command and the ts task both now work for me.

@forsyth2
Copy link
Collaborator Author

Resolved by E3SM-Project/e3sm_to_cmip#244 (and by #556 which allowed for testing of the previous PR)

@tomvothecoder
Copy link
Collaborator

Resolved by E3SM-Project/e3sm_to_cmip#244 (and by #556 which allowed for testing of the previous PR)

I'm happy to hear that my PR actually fixed the issue. I'll release a new RC for e3sm_to_cmip to get that fix into E3SM-Unified.

@forsyth2
Copy link
Collaborator Author

Awesome, thanks @tomvothecoder!

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang @tomvothecoder I ran into the e3sm_to_cmip concurrent/futures/_base.py problem again using E3SM Unified 1.10.0rc1 for a different cfg (not the test suite).

cfg:

[default]
input = /lcrc/group/e3sm2/ac.golaz/E3SMv3/v3.LR.piControl
output = /lcrc/group/e3sm/ac.forsyth2/zppy_p1_output/v3.LR.piControl
case = v3.LR.piControl
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_p1_www
partition = compute
environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.10.0rc1_chrysalis.sh"

[ts]
active = True
walltime = "00:50:00"

  [[ atm_monthly ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  ts_fmt = "cmip"
  years = "0001:0020:5",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  ts_fmt = "cmip"
  vars = "FSH,RH2M,LAISHA,LAISUN"
  years = "0001:0020:5",

  [[ atm_monthly_glb ]]
  input_subdir = "archive/atm/hist"
  input_files = "eam.h0"
  frequency = "monthly"
  mapping_file = "glb"
  years = "0001:0020:10",

  [[ lnd_monthly_glb ]]
  input_subdir = "archive/lnd/hist"
  input_files = "elm.h0"
  frequency = "monthly"
  mapping_file = "glb"
  vars = "FSH,RH2M,LAISHA,LAISUN"
  years = "0001:0020:10",

Error from /lcrc/group/e3sm/ac.forsyth2/zppy_p1_output/v3.LR.piControl/post/scripts/ts_atm_monthly_0001-0005-0005.o505229 :

2024-04-19 19:12:28,039_039:INFO:cmorize:pr: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 931, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
^M 50%|█████     | 1/2 [00:00<00:00,  5.38it/s]  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 931, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
^M100%|██████████| 2/2 [00:00<00:00, 10.74it/s]

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 19, 2024

@chengzhuzhang was able to fix the error by setting mapping_file =map_ne30pg2_to_cmip6_180x360_aave.20200201.nc for ts atmosphere task. Important: e3sm_to_cmip needs regular lat-lon data as input (for atm, land variables)

@forsyth2
Copy link
Collaborator Author

I created E3SM-Project/e3sm_to_cmip#257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing Files in `tests` modified
Projects
None yet
Development

No branches or pull requests

3 participants