-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][compiled graphs] Failing test: test_torch_tensor_dag.py::test_torch_tensor_exceptions[static_shape=True, direct_return=True, overlap_gpu_communication=True]
#48747
Comments
test_torch_tensor_dag.py::test_torch_tensor_exceptions[static_shape=True, direct_return=True, overlap_gpu_communication=True
test_torch_tensor_dag.py::test_torch_tensor_exceptions[static_shape=True, direct_return=True, overlap_gpu_communication=True]
Hi @AndyUB, Thanks for reporting! This does sound like a bug in the overlap functionality. However, when I reran the test ( In your experience how often this fails? Any way to make it more reproducible? |
btw, I think the issue is likely due to in
And the receiver
To fix this issue, we will probably need to make CPU synchronize on the |
Re: test repro, you could try to insert a sleep on the recv stream before queuing the recv.
Not sure that's the whole story, the read of the item requires GPU->CPU movement and is supposed to get queued on the compute stream after syncing on the recv stream. It would be good to check that the read of the item is happening on the expected stream. |
I found the bug. The test failed because I ran with |
What happened + What you expected to happen
The test
ray/python/ray/dag/tests/experimental/test_torch_tensor_dag.py::test_torch_tensor_exceptions[static_shape=True-direct_return=True-overlap_gpu_communication=True]
fails locally.The bug is probably reading the buffer allocated on the receiver side in NCCL P2P send/recv before the actual data is sent. Currently the receiver's buffer is all zeros so the output is all zeros. When I changed the allocation function to allocate
torch.ones(...) * 100
instead, the actual output becomes[100, ..., 100]
.An interesting finding is that when the code executes faster, this test always fails; but when I added a ton of print statements for debugging, it runs more slowly and the test sometimes passes.
Since this test has
overlap_gpu_communication=True
, it is likely related to overlapping GPU communication with computation. My guess is that the actor reading the tensor did not properly wait for the recv event to finish.I checked out to the commit that most recently modified the test: #47586, as well as the current HEAD of the
ray-project:master
branch, and the test failed in either case.Below is an example error message:
Versions / Dependencies
Newest version of Ray. Python: 3.9.
Reproduction script
https://github.com/ray-project/ray/blob/master/python/ray/dag/tests/experimental/test_torch_tensor_dag.py#L813
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: