Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA/CUDA_COPY: Relaxed CUDA context dependency in cuda_copy transport. #10564

Merged
merged 1 commit into from
Mar 26, 2025

Conversation

rakhmets
Copy link
Contributor

What?

Relaxed CUDA context dependency in cuda_copy transport.

@rakhmets rakhmets force-pushed the topic/cuda-cpy-relax-ctx branch 6 times, most recently from d17709b to a287910 Compare March 21, 2025 17:50
@rakhmets rakhmets marked this pull request as ready for review March 21, 2025 17:52
@brminich
Copy link
Contributor

test failure is relevant

2025-03-21T19:54:57.0727290Z [ RUN      ] cuda_copy/test_p2p_no_current_cuda_ctx.get_short/1 <cuda_copy/cuda/loopback>
2025-03-21T19:54:57.0731212Z /__w/1/s/contrib/../test/gtest/common/mem_buffer.cc:419: Failure
2025-03-21T19:54:57.0732107Z cudaMemset(buffer, c, length) failed: invalid argument: ptr=0x7f521e600003 value=0 count=1
2025-03-21T19:54:57.0738553Z [  FAILED  ] cuda_copy/test_p2p_no_current_cuda_ctx.get_short/1, where GetParam() = cuda_copy/cuda/loopback (2 ms)
2025-03-21T19:54:57.0739540Z [ RUN      ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1 <cuda_copy/cuda/loopback>
2025-03-21T19:54:57.0741762Z /__w/1/s/contrib/../test/gtest/common/mem_buffer.cc:419: Failure
2025-03-21T19:54:57.0742498Z cudaMemset(buffer, c, length) failed: invalid argument: ptr=0x7f521e600003 value=0 count=1
2025-03-21T19:54:57.0876641Z [  FAILED  ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1, where GetParam() = cuda_copy/cuda/loopback (1 ms)
2025-03-21T19:54:57.0878180Z [----------] 2 tests from cuda_copy/test_p2p_no_current_cuda_ctx (3 ms total)

@yosefe
Copy link
Contributor

yosefe commented Mar 22, 2025

test failures should be fixed by #10571

@yosefe
Copy link
Contributor

yosefe commented Mar 23, 2025

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@brminich
Copy link
Contributor

can the failure be related?

dcx/test_ucp_tag_mem_type.reuse_buffers_mrail/27 <dc_x,cuda_copy,rocm_copy/cuda-managed:cuda,nogdr,offload>
[     INFO ] 0 1 16 128 1048512 1048580 4194324 [swx-rdmz-ucx-gpu-02:70030:0:70030] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xec451d48)
==== backtrace (tid:  70030) ====
 0  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x12c) [0x7f242df2e3fc]
 1  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3c70c) [0x7f242df2e70c]
 2  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3c98b) [0x7f242df2e98b]

yosefe
yosefe previously approved these changes Mar 24, 2025
@rakhmets rakhmets force-pushed the topic/cuda-cpy-relax-ctx branch from 2e6f6a0 to edb4429 Compare March 24, 2025 14:55
@rakhmets rakhmets force-pushed the topic/cuda-cpy-relax-ctx branch 2 times, most recently from c8d2110 to 6506bb4 Compare March 24, 2025 15:34
yosefe
yosefe previously approved these changes Mar 24, 2025
@yosefe yosefe enabled auto-merge March 24, 2025 18:56
brminich
brminich previously approved these changes Mar 24, 2025
@rakhmets
Copy link
Contributor Author

This is a relevant failure. The issue is in the test, not in the UCT changes.

[ RUN      ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1 <cuda_copy/cuda/loopback>
/__w/1/s/contrib/../test/gtest/common/mem_buffer.cc:419: Failure
cudaMemset(buffer, c, length) failed: invalid argument: ptr=0x7f0f42600000 value=0 count=1
[  FAILED  ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1, where GetParam() = cuda_copy/cuda/loopback (2 ms)

This is another failure in uct_dc_mlx5_ep_fence.

[ RUN      ] dcx/test_ucp_fence32.atomic_add_fadd/1 <dc_x/ep_based>
[swx-rdmz-ucx-gpu-02:90739:1:91105] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11add508)
==== backtrace (tid:  91105) ====
0  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x12c) [0x7f7cd239848c]
1  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3c79c) [0x7f7cd239879c]
2  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3ca1b) [0x7f7cd2398a1b]
3  /usr/lib64/libpthread.so.0(+0xf630) [0x7f7ccdda8630]
4  /__w/1/s/build-test/src/uct/ib/mlx5/.libs/libuct_ib_mlx5.so.0(uct_dc_mlx5_ep_fence+0x19) [0x7f7cd16ae919]
5  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0x91508) [0x7f7cd1be5508]
6  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_atomic_op_nbx+0x21db) [0x7f7cd1beea7b]
7  /__w/1/s/build-test/test/gtest/gtest() [0xafd043]
8  /__w/1/s/build-test/test/gtest/gtest() [0xafc98c]
9  /usr/lib64/libpthread.so.0(+0x7ea5) [0x7f7ccdda0ea5]
10  /usr/lib64/libc.so.6(clone+0x6d) [0x7f7ccce7bb0d]

@brminich
Copy link
Contributor

this failure I already saw in #10559

[ RUN      ] dcx/test_ucp_fence32.atomic_add_fadd/1 <dc_x/ep_based>
[swx-rdmz-ucx-gpu-02:90739:1:91105] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11add508)
==== backtrace (tid:  91105) ====
 0  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x12c) [0x7f7cd239848c]
 1  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3c79c) [0x7f7cd239879c]
 2  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3ca1b) [0x7f7cd2398a1b]
 3  /usr/lib64/libpthread.so.0(+0xf630) [0x7f7ccdda8630]
 4  /__w/1/s/build-test/src/uct/ib/mlx5/.libs/libuct_ib_mlx5.so.0(uct_dc_mlx5_ep_fence+0x19) [0x7f7cd16ae919]

@brminich
Copy link
Contributor

but these two look related

[----------] 2 tests from cuda_copy/test_p2p_no_current_cuda_ctx
[ RUN      ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1 <cuda_copy/cuda/loopback>
/__w/1/s/contrib/../test/gtest/common/mem_buffer.cc:419: Failure
cudaMemset(buffer, c, length) failed: invalid argument: ptr=0x7f0f42600000 value=0 count=1
[  FAILED  ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1, where GetParam() = cuda_copy/cuda/loopback (2 ms)
[ RUN      ] cuda_copy/test_p2p_no_current_cuda_ctx.get_short/1 <cuda_copy/cuda/loopback>
/__w/1/s/contrib/../test/gtest/common/mem_buffer.cc:419: Failure
cudaMemset(buffer, c, length) failed: invalid argument: ptr=0x7f0f42600000 value=0 count=1
[  FAILED  ] cuda_copy/test_p2p_no_current_cuda_ctx.get_short/1, where GetParam() = cuda_copy/cuda/loopback (1 ms)

@rakhmets
Copy link
Contributor Author

but these two look related

[----------] 2 tests from cuda_copy/test_p2p_no_current_cuda_ctx
[ RUN      ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1 <cuda_copy/cuda/loopback>
/__w/1/s/contrib/../test/gtest/common/mem_buffer.cc:419: Failure
cudaMemset(buffer, c, length) failed: invalid argument: ptr=0x7f0f42600000 value=0 count=1
[  FAILED  ] cuda_copy/test_p2p_no_current_cuda_ctx.get_zcopy/1, where GetParam() = cuda_copy/cuda/loopback (2 ms)
[ RUN      ] cuda_copy/test_p2p_no_current_cuda_ctx.get_short/1 <cuda_copy/cuda/loopback>
/__w/1/s/contrib/../test/gtest/common/mem_buffer.cc:419: Failure
cudaMemset(buffer, c, length) failed: invalid argument: ptr=0x7f0f42600000 value=0 count=1
[  FAILED  ] cuda_copy/test_p2p_no_current_cuda_ctx.get_short/1, where GetParam() = cuda_copy/cuda/loopback (1 ms)

As I mentioned above, this is a test issue. The test executes cudaMemset in a separate thread without setting CUDA device. I will move the result check outside the function.

@rakhmets rakhmets dismissed stale reviews from brminich and yosefe via f044fd1 March 25, 2025 10:15
@rakhmets rakhmets force-pushed the topic/cuda-cpy-relax-ctx branch from 6506bb4 to f044fd1 Compare March 25, 2025 10:15
@rakhmets
Copy link
Contributor Author

Disabled auto-merge to avoid conflicts for merging #10538.

@rakhmets rakhmets disabled auto-merge March 25, 2025 13:47
@rakhmets rakhmets force-pushed the topic/cuda-cpy-relax-ctx branch from ef36eb9 to 2938b57 Compare March 26, 2025 12:14
@yosefe yosefe enabled auto-merge March 26, 2025 13:33
@yosefe yosefe merged commit 97d0131 into openucx:master Mar 26, 2025
151 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants