-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hangs with get_zcopy
and cuda_ipc
on dual-GPU workstations
#888
Comments
Can you get a backtrace of the hung process to give us some hints? A python one can be obtained with |
Yeah, unfortunately there's nothing useful there, as usual with Python async, see below: $ py-spy dump -p 7886
Process 7886: python listener.py
Python v3.9.13 (/home/pentschev/miniconda3/envs/rn-221011/bin/python3.9)
Thread 7886 (idle): "MainThread"
select (selectors.py:469)
_run_once (asyncio/base_events.py:1869)
run_forever (asyncio/base_events.py:601)
run_until_complete (asyncio/base_events.py:634)
run (asyncio/runners.py:44)
<module> (listener.py:33)
$ py-spy dump -p 7885
Process 7885: python sender.py
Python v3.9.13 (/home/pentschev/miniconda3/envs/rn-221011/bin/python3.9)
Thread 7885 (idle): "MainThread"
select (selectors.py:469)
_run_once (asyncio/base_events.py:1869)
run_forever (asyncio/base_events.py:601)
run_until_complete (asyncio/base_events.py:634)
run (asyncio/runners.py:44)
<module> (sender.py:24) |
We can get a little more information with |
Unfortunately we have nothing useful either. I took a look at those before as well, but we have only regular Python stack, AFAICT. See logs attached. One more thing that occurred to me now was to enable non-blocking UCX-Py mode with In any case, I won't spend much more time on this now, I need to test this with the new C++ implementation of UCX-Py, perhaps the problem is resolved there already and we won't need to worry about this soon. |
Forgot to attach the files on the previous comment, files are below. listener-gdb.log |
Thanks, this is a useful hint. I suspect indeed that UCX-Py blocking mode has an issue. In particular the code used by @madsbk referenced from stackoverflow to check for an idle event loop does not work! This was not obvious at the time it was used, but a comment in September 2020 (#415 was merged in April 2020) notes that the proposed solution is insufficient. A possible solution might be here: https://stackoverflow.com/questions/63732618/ I can try it out... |
Thanks @wence- , that is a good finding! If you have a bit of time and want to give it a go, please go ahead. I would just suggest not spending too much time on that right away as we may have better performance with a separate progress thread that is implemented in C++ UCX-Py, and then this may become obsolete anyway. |
I've found another case where the blocking progress mode seems to be the cause of failure. This one involves managed memory, by having large enough buffers that will cause device memory to be evicted will eventually lead to a $ UCX_RNDV_FRAG_MEM_TYPE=cuda python -m ucp.benchmarks.send_recv -l ucp-async -d 0 -e 3 -o rmm --rmm-managed-memory -n 10GiB --n-iter 3 --n-warmup-iter 0
Server Running at 10.33.225.163:60512
Client connecting to server at 10.33.225.163:60512
Process SpawnProcess-2:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/send_recv.py", line 95, in client
loop.run_until_complete(client.run())
File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/backends/ucp_async.py", line 142, in run
await ep.recv(recv_msg)
File "/datasets/pentschev/src/ucx-py/ucp/core.py", line 728, in recv
ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp._libs.exceptions.UCXCanceled: <[Recv #001] ep: 0x7f3af9198108, tag: 0x8d190517b568099f, nbytes: 10737418240, type: <class 'cupy.ndarray'>>:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/send_recv.py", line 376, in <module>
main()
File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/send_recv.py", line 368, in main
assert not p2.exitcode
AssertionError If we set |
Forgot to mention that the sample above can be reproduced on a DGX-1. |
It seems that this issue actually stems from a deadlock in UCX. What appears to occur is the Theoretically speaking, UCX-Py is not thread-safe and thus switching to single-thread mode would not be a problem. However, RAFT also uses UCX-Py directly, and it's still not clear to me whether transfers also share the same thread as Dask communication thread or occur on a separate thread, if communication only occurs on the Dask comms thread then we should be ok to just switch to single-thread mode, otherwise we will need to wait for a fix in UCX. |
In UCX-Py, we used to add the
UCX_RNDV_SCHEME=get_zcopy
override due to potential performance issues, this was recently removed in #836 to allow bounce buffers to function.In some cases, using
UCX_RNDV_SCHEME=get_zcopy
(which may still be triggered with the newUCX_RNDV_SCHEME=auto
default) causes UCX-Py to hang non-deterministically. This seems only to be reproducible on workstations with at least two GPUs connected via NVLink, and I'm unable to reproduce it on a DGX-1, for example.The MRE involves 3 files. A
listener.py
working as server, andsender.py
that reproduces the hang mentioned above, and asender2.py
that does NOT reproduce the issue. The only difference fromsender2.py
to the original hang-reproducingsender.py
is how the message to be sent is passed, originally passed to the Python async function (hangs) or created directly within the Python async function (does NOT hang). The files to reproduce are below.listener.py
sender.py
sender2.py
Since the hang does not reproduce deterministically, it may be necessary to run listener/client in a loop, e.g.:
By replacing
sender.py
in the client loop withsender2.py
, all 100 iterations should complete without any hangs, whereas the hang withsender.py
occurs normally in under 10 iterations. Also running withUCX_RNDV_SCHEME=put_zcopy
instead should not reproduce the hang either.Given the error occurs depending on how we pass the message via the async Python interface, I suspect there may be something to do with how the event loop is executing, but I do not have any solid evidence for that at the moment.
cc @rlratzel
The text was updated successfully, but these errors were encountered: