You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The test/nonblocking/mcoll_perf.c test detects incorrect data when comparing two files that were written two different ways which should have identical content.
cd test/nonblocking
srun -n2 ./mcoll_perf /unifyfs/testfile.nc
<snip>
P0: diff at line 282 variable[2] var1_2: NC_INT buf1 != buf2 at position 32762
After tracing pwrite and pread calls under a debugger, the problem is that both ranks write to the same byte offsets without any synchronization in between. In this case, rank 1 writes a fill value and rank 0 later writes actual data. It's a race as to which value actually ends up in the file.
In that write, rank 0 writes to (offset=640, length=16) and (offset=672, length=16), which overlaps with the region that rank 1 wrote to during the fill operation.
The test case can be fixed by adding a call to ncmpi_sync(ncid);:
for (i=2; i<nvars; i++){
/* fill record variables to silence valgrind complaining about uninitialised bytes */
for (j=0; j<array_of_gsizes[0]; j++) {
err = ncmpi_fill_var_rec(ncid, varid[i], j);
CHECK_ERR
}
}
ncmpi_sync(ncid); // <--- add sync here to fix the test case
for (i=0; i<nvars; i++){
err = ncmpi_put_vara_all(ncid, varid[i], starts[i], counts[i], buf[i], bufcounts[i], MPI_INT);
CHECK_ERR
}
For reference, here is the sequence of (offset, length) values for writes from different ranks when k==0. There are multiple overlapping writes, one of which is shown below:
The
test/nonblocking/mcoll_perf.c
test detects incorrect data when comparing two files that were written two different ways which should have identical content.After tracing
pwrite
andpread
calls under a debugger, the problem is that both ranks write to the same byte offsets without any synchronization in between. In this case, rank 1 writes a fill value and rank 0 later writes actual data. It's a race as to which value actually ends up in the file.The fill call is here:
https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L521
When filling the variable 2, rank 1 writes to (offset=648, length=8) and (offset=680, length=8).
And the write call is here:
https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L526
In that write, rank 0 writes to (offset=640, length=16) and (offset=672, length=16), which overlaps with the region that rank 1 wrote to during the fill operation.
The test case can be fixed by adding a call to
ncmpi_sync(ncid);
:For reference, here is the sequence of (offset, length) values for writes from different ranks when
k==0
. There are multiple overlapping writes, one of which is shown below:The text was updated successfully, but these errors were encountered: