Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PnetCDF unlink race condition in nc_test #744

Open
adammoody opened this issue Dec 1, 2022 · 0 comments
Open

PnetCDF unlink race condition in nc_test #744

adammoody opened this issue Dec 1, 2022 · 0 comments
Labels

Comments

@adammoody
Copy link
Collaborator

adammoody commented Dec 1, 2022

In PnetCDF, the nc_test has a number of quick running tests that each create a scratch file, execute I/O operations on that file, and then delete the scratch file. The consecutive sequence of tests shown here complete quickly and all use the same filename for the scratch file.

https://github.com/Parallel-NetCDF/PnetCDF/blob/6c71a30cd95f575c01025c0c926fc06dc9157774/test/nc_test/nc_test.c#L414-L420

The MPI_File_open() call of one of these tests fails when it attempts to sync extents with the server during an internal call to close(). ROMIO's MPI_File_open() calls both open() and close() on the file. The close() then tries to sync extents with the server because the file had been opened for writing.

https://github.com/Parallel-NetCDF/PnetCDF/blob/6c71a30cd95f575c01025c0c926fc06dc9157774/test/nc_test/test_write.m4#L394

The extent sync fails because the file has been deleted so that meta->fid (-1) != fid (2) at this check:

int unifyfs_fid_sync_extents(unifyfs_client* client,
int fid)
{
unifyfs_filemeta_t* meta = unifyfs_get_meta_from_fid(client, fid);
if (NULL == meta) {
LOGDBG("no filemeta for fid=%d", fid);
return UNIFYFS_SUCCESS;
} else if(meta->fid != fid) {
/* bail out with an error if we fail to find it */
LOGERR("missing filemeta for fid=%d", fid);
return UNIFYFS_FAILURE;

This situation happens because the prior test deleted the scratch file via MPI_File_delete() -> unlink().

https://github.com/Parallel-NetCDF/PnetCDF/blob/6c71a30cd95f575c01025c0c926fc06dc9157774/test/nc_test/test_write.m4#L437

That unlink() invokes an unlink rpc from client-to-server, which induces a later server-to-client unlink callback rpc that comes back to the client from the server at some future point in time. In this case, the unlink callback fires in the middle of MPI_File_open's open() and close() calls. The open() successfully recreates the file, then it is deleted due to the unlink callback, and then the close() fails to sync extents, which causes MPI_File_open to return an error:

/* if file was opened for writing, sync it */
if (fdesc->write) {
LOGDBG("syncing fid=%d", fid);
int sync_rc = unifyfs_fid_sync_extents(posix_client, fid);
if (sync_rc != UNIFYFS_SUCCESS) {
errno = unifyfs_rc_errno(sync_rc);
return -1;
}
}

@adammoody adammoody changed the title unlink race condition in PnetCDF nc_test PnetCDF unlink race condition in nc_test Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

1 participant