Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PnetCDF test leads to margo error, which leads to hang in ROMIO #783

Open
adammoody opened this issue Jul 7, 2023 · 0 comments
Open

PnetCDF test leads to margo error, which leads to hang in ROMIO #783

adammoody opened this issue Jul 7, 2023 · 0 comments
Labels

Comments

@adammoody
Copy link
Collaborator

adammoody commented Jul 7, 2023

While running a particular margo test

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/test/largefile/high_dim_var.c

with 4 ranks on 2 nodes, a read from rank 2 invokes a failure on the server, which generates the following logs:

023-07-06T16:00:56 tid=872735 @ signal_new_requests() [unifyfs_request_manager.c:269] signaling new requests
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1802] RM[1511587981:1] got work
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1631] processing 1 client requests
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1324] processing mread[0] with 1 requests
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:252] handling read request (1 extents)
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:179] margo_bulk_transfer(buf_offset=0, len=1572864) failed
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:197] failed bulk transfer - transferred 0 of 1572864 bytes
2023-07-06T16:00:56 tid=873012 @ unifyfs_invoke_find_extents_rpc() [unifyfs_p2p_rpc.c:665] failed to get bulk chunk locations
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:279] failed to find extent locations
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1333] unifyfs_fops_read() failed
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1690] client rpc request 0 failed ("Mercury/Argobots operation error")
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1768] failed to process client rpc requests

The error code returned to the client for the read is 1004. That probably corresponds to one of these:

https://github.com/mercury-hpc/mercury/blob/55b95f72714bb0e4e0deeedf4fd78d116ea9476a/src/mercury_core_types.h#L102-L108

The read error happens during PMI_File_read_at_all which then leads to a deadlock in ROMIO:
pmodels/mpich#6585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

1 participant