You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the test file already exists, setting the "remove and reuse test file" flag (-r) would easily trigger the bug. mpirun -n 2 write-static -n 1 -r
Issue
At the end of the test_remove_file(), we do a barrier() to sync processes.
The problem is, there is a potential execution path where one or more processes may exit the function earlier, causing wrong barrier matching.
This part of code assumes all ranks check the existence of file, get rc = 0, and continue.
Problematic execution sequence:
Rank 0 calls stat(), get rc = 0, then go ahead delete the file. Then Rank 1 came, its stat() call will return -1 since Rank 0 has already deleted the file. Then Rank 1 exits the function without calling the barrier() at the end.
Fix
Add a barrier immediately after stat() to make sure everyone sees the same result.
Overall, we should be careful when using "return" if we will do a barrier later. We need to make sure all processes will always follow the same path: either all return, or all call barrier.
The text was updated successfully, but these errors were encountered:
wangvsa
added a commit
to wangvsa/UnifyFS
that referenced
this issue
Jul 1, 2024
Describe the problem you're observing
I was doing some testing using the write-static code and found a subtle bug in
UnifyFS/examples/src/testutil.h
Line 1251 in 58ece44
Describe how to reproduce the problem
When the test file already exists, setting the "remove and reuse test file" flag (-r) would easily trigger the bug.
mpirun -n 2 write-static -n 1 -r
Issue
At the end of the
test_remove_file()
, we do abarrier()
to sync processes.The problem is, there is a potential execution path where one or more processes may exit the function earlier, causing wrong barrier matching.
UnifyFS/examples/src/testutil.h
Lines 1262 to 1268 in 58ece44
This part of code assumes all ranks check the existence of file, get rc = 0, and continue.
Problematic execution sequence:
Rank 0 calls stat(), get rc = 0, then go ahead delete the file. Then Rank 1 came, its stat() call will return -1 since Rank 0 has already deleted the file. Then Rank 1 exits the function without calling the barrier() at the end.
Fix
Add a barrier immediately after stat() to make sure everyone sees the same result.
Overall, we should be careful when using "return" if we will do a barrier later. We need to make sure all processes will always follow the same path: either all return, or all call barrier.
The text was updated successfully, but these errors were encountered: