GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition #1031

SamuelTrahanNOAA · 2023-09-21T16:53:13Z

Description

Normally I don't cross-post bugs between forks, but this is a pretty big one. I want to make sure everyone is aware.

I reported it in the UFS fork already: ufs-community#105

The GFS_phys_time_vary_init is parallelized using mpi sections, but it does not correctly handle errmsg or errflg. All threads update the same errmsg and errflg. That means a failure message can be overwritten by a success message in a later step.

To visualize this, suppose there are two threads running at once. For simplicity's sake, lets say there are only two initialization calls: init_that_fails() and init_that_succeeds()

Failure happens first

Events happened in this order:

Thread 1: Completes init_that_fails() and sets errmsg=1
Thread 2: Completes init_that_succeeds() and sets errmsg=0

The errmsg is 0 and the model will run even though one of the initialization steps failed.

Failure happens second

Events happened in this order:

Thread 2: Completes init_that_succeeds() and sets errmsg=0
Thread 1: Completes init_that_fails() and sets errmsg=1

The errmsg is 1 so the model will abort as expected.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

Delete noahmptable.tbl
Use a scheme that does not require that file.
Run the model a few times with at least two threads.
Notice that it fails sporadically instead of 100% of the time.

Additional Context

This was discovered in an RRFS parallel. The machine, compiler, etc. doesn't matter. However, the easiest way to see it is to run a non-NOAHMP suite without noahmptable.tbl.

SamuelTrahanNOAA added the bug label Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition #1031

GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition #1031

SamuelTrahanNOAA commented Sep 21, 2023

GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition #1031

GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition #1031

Comments

SamuelTrahanNOAA commented Sep 21, 2023

Description

Failure happens first

Failure happens second

Steps to Reproduce

Additional Context