You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The GFS_phys_time_vary_init is parallelized using mpi sections, but it does not correctly handle errmsg or errflg. All threads update the same errmsg and errflg. That means a failure message can be overwritten by a success message in a later step.
To visualize this, suppose there are two threads running at once. For simplicity's sake, lets say there are only two initialization calls: init_that_fails() and init_that_succeeds()
Failure happens first
Events happened in this order:
Thread 1: Completes init_that_fails() and sets errmsg=1
Thread 2: Completes init_that_succeeds() and sets errmsg=0
The errmsg is 0 and the model will run even though one of the initialization steps failed.
Failure happens second
Events happened in this order:
Thread 2: Completes init_that_succeeds() and sets errmsg=0
Thread 1: Completes init_that_fails() and sets errmsg=1
The errmsg is 1 so the model will abort as expected.
Steps to Reproduce
Please provide detailed steps for reproducing the issue.
Delete noahmptable.tbl
Use a scheme that does not require that file.
Run the model a few times with at least two threads.
Notice that it fails sporadically instead of 100% of the time.
Additional Context
This was discovered in an RRFS parallel. The machine, compiler, etc. doesn't matter. However, the easiest way to see it is to run a non-NOAHMP suite without noahmptable.tbl.
The text was updated successfully, but these errors were encountered:
Description
Normally I don't cross-post bugs between forks, but this is a pretty big one. I want to make sure everyone is aware.
I reported it in the UFS fork already: ufs-community#105
The GFS_phys_time_vary_init is parallelized using mpi sections, but it does not correctly handle errmsg or errflg. All threads update the same errmsg and errflg. That means a failure message can be overwritten by a success message in a later step.
To visualize this, suppose there are two threads running at once. For simplicity's sake, lets say there are only two initialization calls: init_that_fails() and init_that_succeeds()
Failure happens first
Events happened in this order:
Thread 1: Completes init_that_fails() and sets errmsg=1
Thread 2: Completes init_that_succeeds() and sets errmsg=0
The errmsg is 0 and the model will run even though one of the initialization steps failed.
Failure happens second
Events happened in this order:
Thread 2: Completes init_that_succeeds() and sets errmsg=0
Thread 1: Completes init_that_fails() and sets errmsg=1
The errmsg is 1 so the model will abort as expected.
Steps to Reproduce
Please provide detailed steps for reproducing the issue.
Additional Context
This was discovered in an RRFS parallel. The machine, compiler, etc. doesn't matter. However, the easiest way to see it is to run a non-NOAHMP suite without noahmptable.tbl.
The text was updated successfully, but these errors were encountered: