-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked #10370
Comments
Hi, in order to fully support ctrl-c, the connection with UCX should be done using an IP address, for example see https://github.com/openucx/ucx/blob/master/examples/ucp_client_server.c and ucx/test/apps/iodemo/ucx_wrapper.cc Lines 813 to 840 in 79de87a
is this what being done in the RPC framework? |
@yosefe Thank you for your attention to my question. Strictly speaking, I use the same methods in ucp_hello_world and ucx_perftest. First, the client will use an extra OOB TCP connection to request the server to obtain the address of all workers, and then use it to create all ucp_ep. _params.field_mask |= UCP_EP_PARAM_FIELD_REMOTE_ADDRESS;
_params.address = worker_address The client establishes a full connection with all the server workers so that it can hash the request to the specified worker thread for processing. Does this method not support using |
Not, it doesn't. |
Hi @yosefe, Thanks for answering. But by doing so, I don't know how to hash the request to the specified worker?I cannot know which server worker corresponds to the ep created on the client. Do you have any good ideas? |
@ivanallen 2 possible options are to extract the client address or client_id from the connection request, see https://github.com/openucx/ucx/blob/fa01cca77754edb7dd510190640c5feb9fb2b366/src/ucp/api/ucp.h#L1520C3-L1520C26 |
Hi @yosefe Yes, I know this API. Such as in the following picture. work0 in the client would like to send a request to work0 in the server. |
I see, so to have precise control on the client can have multiple listeners on different ports. |
@yosefe Thank you very much. Finally, I would like to ask why |
In order to respond to ctrl-c, there must be a kernel-based "channel" to send a FIN message and let the other side to detect the connection is closed. By the current design, that kind of channel is created in UCX only with IP/port based connection establishment. |
Hi @yosefe I have replaced the code and upgraded to 1.18.0-rc2, but I find that the problem still exists and there is a difference. Now the callback ucp_am_send_nbx on the client side can be executed, but the am handler on the server side does not trigger the execution. Of the server's eight workers, there are always individual workers who cannot receive requests. This only occurs at extremely high QPS. The following is the log(ucx log reference attachment). You can see that request_id:1 of the client is sent successfully, but request_id:2 does not respond. request_id:1 requests worker0 in the server. This channel is normal. But request_id:2 uses worker1 in the server, and the channel is not responding. (we use round robin to send requests to different server workers.) client log:
server log:
client ucx log: |
|
Hi @yosefe, I've got some new clues. I printed the FIFO data of the server through gdb, and found that the function uct_mm_iface_fifo_has_new_data was judged to be false, which caused subsequent new requests entering the queue could not be processed.
static UCS_F_ALWAYS_INLINE int
uct_mm_iface_fifo_has_new_data(uct_mm_iface_t *iface)
{
// ((iface->read_index >> iface->fifo_shift) & 1) => (239204 >> 8) & 1 => 0
// (iface->read_index_elem->flags & 1) => 1 & 1 => 1
// return false
return (((iface->read_index >> iface->fifo_shift) & 1) ==
(iface->read_index_elem->flags & 1));
} |
Hi @yosefe I may already know the root cause of the problem. ucx/src/uct/sm/mm/base/mm_ep.c Lines 375 to 378 in 9ce35d0
When the client crashes before line 378, the statement Should ucx add a check to mark that the process to which the elem belongs no longer exists, so that processing of the elem can be skipped? Or some other solution? I can reproduce the problem perfectly with fault injection.
Hang! |
@ivanallen currently shared memory transport does not support detecting of the remote process crash. setting UCX_MM_ERROR_HANDLING=y disregards this limitation and enables selecting shared memory transport. |
Hi @yosefe Thank you for your reply. If that's the case, then it seems that shared memory transport cannot be used in the production environment. It appears there is no self-healing method available for the server unless restarting it. However, I still hope that UCX can resolve this issue, but it seems quite challenging at the moment. We have related experience in the production environment, which is to maintain a separate FIFO queue for each producer(SPSC fifo), if this producer fails, then the server will remove the FIFO associated with this producer. The problem with this is that you may need to maintain a large number of FIFO queues, but this does not present much of a problem at the moment in our environment. |
Describe the bug
We write an RPC framework based on am. The client of the rpc framework uses ucp_am_send_nbx to send the request message. After the server processes the request message, the response message is sent to the client through ucp_am_send_nbx.
When I was testing
UCX_TLS=shm,tcp
with rpc_press(a tool to perf RPC framework), it was possible for the press tool to hang up when I broke it withctrl c
and then run rpc_press again.At this point, I use another simple echo client to send a request to the server, which also hang up and get no response.
After analysis, we found that the client's ucp_am_send_nbx callback was also not executed.
If I use UCX_TLS=tcp, it works. However, UCX_TLS=shm,tcp does not work.
To recover, you must restart the server.
Steps to Reproduce
xrpc_server is a simple echo server based on ucx.
rpc_press starts 32 threads to connect the xrpc_server and send request.
start rpc_press again, but it hangs!
use xrpc_client to send the request and hangs too!
Setup and versions
ucx log:
ucx_log.txt
The text was updated successfully, but these errors were encountered: