-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address "flaky" ts
subtask in complete_run
test
#549
Comments
@tomvothecoder Do you happen to have any thoughts on this? Unfortunately, integration tests are more state-dependent than unit tests, and on top of that, |
Re: remaining question (2), seeing |
@forsyth2 I'm trying to study what is a flaky test. It is for repeating a same test many times, but get different results. In your cases, among the 4 runs, which are the same tests? I might be a little confused. |
I suppose I'm using the term somewhat loosely. My point is that the |
@forsyth2 I think I found a bug when testing the configuration file copied from #549 (comment).
|
@forsyth2 I think I figured what caused the flaky test documented here and potentially the bundle run test documented in This is a bug in setting up
Whether the |
@chengzhuzhang Thank you for identifying the problem. We'll have to set up separate subdirectories to avoid this.
I think we would add a directory after Note: another option is to force Lines 88 to 91 in a902549
and Lines 244 to 250 in a902549
|
I tried out the solution to save data separately, unfortunately the problem persist... |
And I can provide a minimum example that runs e3sm_to_cmip standalone, but won't be able to reproduce the errors.
|
I tried as well and indeed it persists. Unfortunately, the concurrency error message from |
Actually, I'm not even sure running sequentially will help. I tried rerunning Is it possible the error is in |
@tomvothecoder could you please help look at this issue. At this point, we think the zppy issue is caused by how e3sm_to_cmip handles concurrency. The error can't be reproduced with single e3sm_to_cmip runs. |
Sure I can take a look. What is the priority for this issue and did we want to get it fixed before E3SM-Unified 1.9.3? If I understand correctly, E3SM-Unified 1.9.3 will probably be delayed for further |
@tomvothecoder thanks. Yes, addressing this issue would be a high priority for the E3SM-Unified 1.9.3 release. Ryan and I have minimal examples to test e3sm_to_cmip standalone or through zppy. We can chat off line for speeding up testing. |
Thanks @tomvothecoder!! |
Got it, I'll focus on it tomorrow and let you know how it goes. I do have AWS on Friday, so hopefully I make some good progress by then. |
@tomvothecoder If it's useful, I tried doing some more debugging in my testing-updates PR (#554), but it's mostly just giving the same errors I've already described in this issue. |
@chengzhuzhang What is the absolute directory for EDIT: Nevermind that's just the file in the |
@tomvothecoder yes! It's my home directory where I git cloned zppy. And the copy was originally taken from e3sm_to_cmip. |
@chengzhuzhang just like you, I can't reproduce this issue running My AnalysisI noticed that The errors mentioned below are related to I/O. Specifically, one of them is raised by Xarray. RE:
|
I just merged E3SM-Project/e3sm_to_cmip#244. Is there any way to test |
Yes, it's possible but annoying. (Because In any case, I will try to test it out. Thanks for working on this. |
Actually looking at c939071, I suppose the easier-to-implement alternative is to just keep |
@tomvothecoder I'm testing your changes over on #556, but I'm getting the following
|
@forsyth2 I believe I found some relevant issues: E3SM-Project/e3sm_to_cmip#115 |
@chengzhuzhang Yes, that's what I thought as well, but I just created a new dev environment off |
@chengzhuzhang, No I'm still definitely getting the
That seems to match the directions at https://github.com/E3SM-Project/e3sm_to_cmip?tab=readme-ov-file#2-conda-development-environment-and-source-code. I was reading the thread on #115, but shouldn't that all be resolved now? |
@forsyth2 I also tried to create a new env and install e3sm_to_cmip. By testing a standalone run, can't reproduce the error:
|
@chengzhuzhang Interesting, that command also causes the |
I wonder if it's possibly related to #490. That's the only thing I can think of re: things working for other people but not me. |
@tomvothecoder added a |
I don't think it's that, since that seems to be related to the
Yes, let me try that. |
@chengzhuzhang thanks, the |
Resolved by E3SM-Project/e3sm_to_cmip#244 (and by #556 which allowed for testing of the previous PR) |
I'm happy to hear that my PR actually fixed the issue. I'll release a new RC for |
Awesome, thanks @tomvothecoder! |
@chengzhuzhang @tomvothecoder I ran into the cfg:
Error from
|
@chengzhuzhang was able to fix the error by setting |
I created E3SM-Project/e3sm_to_cmip#257 |
Request criteria
Issue description
The
complete_run
integration test is "flaky" (e.g., as in https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html). Specifically, the[[ ts ]] > [[ land_monthly ]]
subtask.I was able to get the test passing while testing #424. However, it appears I never ran the test straight-through and instead did several re-runs, which actually had the effect of creating a different final output than if it had been run straight through.
I came across this while testing #548. The subtask was suddenly failing despite no relevant changes being made. I then tested on the
main
branch (specifically at a902549) and encountered the same errors. I found the following.Run 1
Relevant sections from
tests/integration/generated/test_complete_run_chrysalis.cfg
:Then:
Run 2
I changed
vars = "FSH,RH2M"
tovars = "LAISHA,LAISUN"
and re-ran (without deleting the output directory, meaning only failing jobs were re-ran).A similar
cannot stat
withCMIP6/CMIP
line appears in issues #523 (comment), #543 and in PRs #424 (comment), #533, #534 (comment).Run 3
I re-ran (without deleting the output directory, meaning only failing jobs were re-ran), without changing anything.
Run 4
I deleted the ilamb status files and re-ran (without deleting the output directory, meaning only the status-file-missing ILAMB jobs were re-ran)
Conclusions
The
ts_land_monthly
subtask is flaky. Looking at https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html, it is flaky in the sense that there is a state leak.=> While it is useful to have zppy only re-run the failed jobs, this does not in fact guarantee that all jobs would have succeeded had the latest
cfg
been used from the start!=> All pre-PR-merging
zppy
runs should use a clean output directory, in order to produce valid integration test results.Remaining questions
So, what went wrong with testing Native EAMxx support #424? Setting
vars = "FSH,LAISHA,LAISUN,RH2M"
caused an error (Native EAMxx support #424 (comment)) which was actually equivalent to the error in Run 2 above. Rather than re-running, I removed the erroring variablesLAISHA,LAISUN
(Native EAMxx support #424 (comment)) and got the job to work. But that doesn't make sense because that means Native EAMxx support #424 testing worked withvars = "FSH,RH2M"
, which clearly causes errors in Run 1 above.Why did Run 3 work simply by re-running without changing anything? I have a vague memory of there potentially being some sort of race or lock condition in accessing the files in
post/...
. Could that be what's going on? Could the twots_land_monthly
jobs possibly be interacting with each other in any way on Run 1?The text was updated successfully, but these errors were encountered: