You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When creating a file in PnetCDF, one can use the NC_CLOBBER flag, which indicates that any existing file should first be unlinked or truncated to 0 bytes.
For regular files, the implementation calls unlink or MPI_File_delete on rank 0:
When running some tests with higher rank counts, random ranks fail with a "bad file descriptor" error. For example, test/testcases/test_erange.c fails when running 6 ranks on one node with the following errors:
+ srun --overlap -n 6 -N 1 ./test_erange /unifyfs/testfile.nc
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 107 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 109 in test_erange.c: (NC_EREAD)
Error at line 111: unexpected read value 3 (expecting 255)
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 117 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 119 in test_erange.c: (NC_EREAD)
Error at line 121: unexpected read value 0 (expecting -128)
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 155 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 157 in test_erange.c: (NC_EREAD)
Error at line 159: unexpected read value -48 (expecting -128)
MPI error (MPI_File_close) : Other I/O error , error stack:
ADIOI_GEN_CLOSE(120): Other I/O error Bad file descriptor
Error at line 191 in test_erange.c: (NC_EFILE)
*** TESTING C test_erange for checking for NC_ERANGE ------ fail with 15 mismatches
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 226 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 229 in test_erange.c: expecting NC_ERANGE but got NC_EREAD
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 248 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 250 in test_erange.c: expecting NC_ERANGE but got NC_EREAD
MPI error (MPI_File_close) : Other I/O error , error stack:
ADIOI_GEN_CLOSE(120): Other I/O error Bad file descriptor
Error at line 252 in test_erange.c: (NC_EFILE)
srun: error: quartz5: tasks 0-5: Exited with exit code 1
This test program contains a number of consecutive test cases:
I believe our delayed unlink may be the cause. I think the file (and its descriptor) gets deleted in the background on some ranks after the file has been opened and while it is in use. A future write call then fails when it detects that the file descriptor is no longer valid.
Setting UNIFYFS_CLIENT_UNLINK_USECS does not help in this case, or at least the values of 1 sec and 10 secs do not help. I'm not yet sure why.
PnetCDF happens to have a code path in which it truncates the file rather than deletes it. It uses this for symlinks, but it deletes regular files. Hacking this so that PnetCDF truncates regular files (rather than deleting them) allows the test case to pass. That amounts to commenting out this line:
When creating a file in PnetCDF, one can use the
NC_CLOBBER
flag, which indicates that any existing file should first be unlinked or truncated to 0 bytes.For regular files, the implementation calls
unlink
orMPI_File_delete
on rank 0:https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/src/drivers/ncmpio/ncmpio_create.c#L132-L153
while all other ranks wait in a call to
MPI_Bcast
to be signaled by rank 0 that it has the deleted the file.https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/src/drivers/ncmpio/ncmpio_create.c#L202-L205
When running some tests with higher rank counts, random ranks fail with a "bad file descriptor" error. For example,
test/testcases/test_erange.c
fails when running 6 ranks on one node with the following errors:This test program contains a number of consecutive test cases:
https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/test_erange.c#L287-L288
that each open the same file with
NC_CLOBBER
:https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/test_erange.c#L49-L50
I believe our delayed unlink may be the cause. I think the file (and its descriptor) gets deleted in the background on some ranks after the file has been opened and while it is in use. A future write call then fails when it detects that the file descriptor is no longer valid.
Setting
UNIFYFS_CLIENT_UNLINK_USECS
does not help in this case, or at least the values of 1 sec and 10 secs do not help. I'm not yet sure why.PnetCDF happens to have a code path in which it truncates the file rather than deletes it. It uses this for symlinks, but it deletes regular files. Hacking this so that PnetCDF truncates regular files (rather than deleting them) allows the test case to pass. That amounts to commenting out this line:
https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/src/drivers/ncmpio/ncmpio_create.c#L103
Short term, PnetCDF users who run into this bug may need the above one-line patch.
Medium term, perhaps the PnetCDF team would be open to defining a new hint to enable users to select the truncate path rather than the delete path.
Long term, we should fix our unlink implementation. #744
The text was updated successfully, but these errors were encountered: