Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] nccl allgather hung #2560

Open
akhoroshev opened this issue Dec 11, 2024 · 4 comments
Open

[bug] nccl allgather hung #2560

akhoroshev opened this issue Dec 11, 2024 · 4 comments

Comments

@akhoroshev
Copy link
Contributor

akhoroshev commented Dec 11, 2024

My version

tp4 fp8 deepseek like custom model. ExecutorApi (mpirun -n 4 command for start).

nccl version: 2.22.3
I use batched logit processor to control generation:

Main rank batch logit processor config (does logit modification)

tensorrt_llm::executor::LogitsPostProcessorConfig logitsPostProcessorConfig(
            std::nullopt, LogitsProcessorBatchedAdaptor{mMainRankState->logitsProcessor, *this}, false);
        executorConfig.setLogitsPostProcessorConfig(logitsPostProcessorConfig);

Other ranks config (actually do nothing)

tensorrt_llm::executor::LogitsPostProcessorConfig logitsPostProcessorConfig(
            std::nullopt, [](auto&&...) {}, false);
        executorConfig.setLogitsPostProcessorConfig(logitsPostProcessorConfig);

When running the stability test everything works fine for the first 20 hours, but after that time (or about 2 million requests) the server hangs.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:04:00.0 Off |                    0 |
| N/A   38C    P0            152W /  700W |   75370MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:23:00.0 Off |                    0 |
| N/A   34C    P0            129W /  700W |   75368MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   37C    P0            137W /  700W |   75368MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:64:00.0 Off |                    0 |
| N/A   36C    P0            135W /  700W |   75368MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2211609      C   /opt/wmcore/executor_server                 75358MiB |
|    1   N/A  N/A   2211610      C   /opt/wmcore/executor_server                 75358MiB |
|    2   N/A  N/A   2211611      C   /opt/wmcore/executor_server                 75358MiB |
|    3   N/A  N/A   2211612      C   /opt/wmcore/executor_server                 75358MiB |

cuda-gdb attach to MAIN pid to determine active kernel (other ranks do not have active kernels)

(cuda-gdb) info stack
#0  0x00007f64a9c1f480 in ncclDevFunc_AllGather_RING_SIMPLE() ()
#1  0x00007f601be4e600 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(544,1,1)>>> ()
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status                             SMs Mask  GridDim  BlockDim Invocation                        
*      0      -   0 5059392810 Active 0x000000000000000000000000000fff0fff (24,1,1) (544,1,1) ncclDevKernel_AllGather_RING_LL() 

some meaningful backtraces

Thread 16 (Thread 0x7f5ff8ffd000 (LWP 2791236) "executor_server"):
#0  0x00007f6b84ad47aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f6bf41a13ee in tensorrt_llm::executor::Executor::Impl::awaitResponses(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#2  0x00007f6bf4193ead in tensorrt_llm::executor::Executor::awaitResponses(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#3  0x000000000045d85b in modules::executor_server::Executor::ExecutorImpl::awaitRoutine (this=0x24c3180, interruptToken=...) at /sources/contrib/tensorrt-llm/modules/executor_server/src/serverImpl.cpp:330
#4  0x00000000004639c0 in operator() (__closure=<optimized out>, __closure=<optimized out>, stopToken=...) at /sources/contrib/tensorrt-llm/modules/executor_server/src/serverImpl.cpp:252
#5  __invoke_impl<void, modules::executor_server::Executor::ExecutorImpl::ExecutorImpl(modules::executor_server::MainRank, const std::filesystem::__cxx11::path&, const modules::executor_server::ExecutorConfig&, const modules::executor_server::LogitsProcessorStaticConfig&)::<lambda(std::stop_token)>, std::stop_token> (__f=...) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/bits/invoke.h:60
#6  __invoke<modules::executor_server::Executor::ExecutorImpl::ExecutorImpl(modules::executor_server::MainRank, const std::filesystem::__cxx11::path&, const modules::executor_server::ExecutorConfig&, const modules::executor_server::LogitsProcessorStaticConfig&)::<lambda(std::stop_token)>, std::stop_token> (__fn=...) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/bits/invoke.h:95
#7  _M_invoke<0, 1> (this=<optimized out>) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/thread:264
#8  operator() (this=<optimized out>) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/thread:271
#9  _M_run (this=<optimized out>) at /opt/rh/gcc-toolset-10/root/usr/include/c++/10/thread:215
#10 0x00007f6bcdb2ca80 in execute_native_thread_routine () from /home/askhoroshev/wmcore/lib/libtensorrt_llm_nvrtc_wrapper.so
#11 0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#12 0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 15 (Thread 0x7f5ff97fe000 (LWP 2791234) "executionLoop"):
#0  0x00007f6b86894e88 in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#1  0x00007f6b86631833 in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#2  0x00007f6b8699fb3f in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#3  0x00007f6b8699fed5 in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#4  0x00007f6b866382cc in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#5  0x00007f6b8670117a in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#6  0x00007f6b86970459 in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#7  0x00007f6b867a58fd in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#8  0x000000000049f575 in libcudart_static_141dba5462e92d2cffd1abc474df476c510a3a8c ()
#9  0x0000000000504248 in cudaStreamSynchronize ()
#10 0x00000000004606c7 in tensorrt_llm::runtime::CudaStream::synchronize (this=<optimized out>) at /sources/contrib/tensorrt-llm/cpp/include/tensorrt_llm/runtime/cudaStream.h:84
#11 modules::executor_server::FusedLogitsProcessor::process (bufferManager=..., stream=..., beamTokens=..., tensorPtrs=<synthetic pointer>..., logitsRequestStates=..., this=0x24c31c0) at /sources/contrib/tensorrt-llm/modules/executor_server/src/logitsProcessor.cpp:658
#12 modules::executor_server::Executor::ExecutorImpl::LogitsProcessorBatchedAdaptor::operator() (this=0x13cac3b0, internalIds=..., logitTensors=..., beamTokens=..., stream=..., userIds=...) at /sources/contrib/tensorrt-llm/modules/executor_server/src/serverImpl.cpp:238
#13 0x00007f6bf419b315 in std::_Function_handler<void (std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::allocator<std::shared_ptr<tensorrt_llm::runtime::ITensor--Type <RET> for more, q to quit, c to continue without paging--
> > >&, std::vector<std::reference_wrapper<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const>, std::allocator<std::reference_wrapper<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const> > > const&, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::vector<std::optional<unsigned long>, std::allocator<std::optional<unsigned long> > > const&), tensorrt_llm::executor::Executor::Impl::initializeLogitsPostProcessorBatched(tensorrt_llm::executor::LogitsPostProcessorConfig const&)::{lambda(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::allocator<std::shared_ptr<tensorrt_llm::runtime::ITensor> > >&, std::vector<std::reference_wrapper<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const>, std::allocator<std::reference_wrapper<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const> > > const&, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::vector<std::optional<unsigned long>, std::allocator<std::optional<unsigned long> > > const&)#1}>::_M_invoke(std::_Any_data const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::allocator<std::shared_ptr<tensorrt_llm::runtime::ITensor> > >&, std::vector<std::reference_wrapper<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const>, std::allocator<std::reference_wrapper<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const> > > const&, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::vector<std::optional<unsigned long>, std::allocator<std::optional<unsigned long> > > const&) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#14 0x00007f6bf41793c4 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderStepAsync(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#15 0x00007f6bf417cde5 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#16 0x00007f6bf41a6a71 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#17 0x00007f6bf41ab97f in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#18 0x00007f6bcdb2ca80 in execute_native_thread_routine () from /home/askhoroshev/wmcore/lib/libtensorrt_llm_nvrtc_wrapper.so
#19 0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#20 0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 14 (Thread 0x7f5ff9fff000 (LWP 2791231) "dataTransResp"):
#0  0x00007f6b84ad445c in pthread_cond_wait@@GLIBC_2.3.2 () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f6b847ed870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /home/askhoroshev/wmcore/lib/libstdc++.so.6
#2  0x00007f6bf4110d33 in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#3  0x00007f6bf410ecdd in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#4  0x00007f6bf410f2fb in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#5  0x00007f6b84ad5e67 in __pthread_once_slow () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#6  0x00007f6bf410facd in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/wmcore/lib/libtensorrt_llm.so
#7  0x00007f6bcdb2ca80 in execute_native_thread_routine () from /home/askhoroshev/wmcore/lib/libtensorrt_llm_nvrtc_wrapper.so
#8  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#9  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 13 (Thread 0x7f601d51d000 (LWP 2791219) "executor_server"):
#0  0x00007f6b83ef5f41 in poll () from /home/askhoroshev/wmcore/lib/libc.so.6
#1  0x00007f6b880ed3ea in ncclProxyServiceUDS(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#2  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#3  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 12 (Thread 0x7f694cee8000 (LWP 2791218) "executor_server"):
#0  0x00007f6b83ef5f41 in poll () from /home/askhoroshev/wmcore/lib/libc.so.6
#1  0x00007f6b880eed32 in ncclProxyService(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#2  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#3  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 11 (Thread 0x7f602cffd000 (LWP 2791205) "executor_server"):
#0  0x00007f6b84ad7ab4 in read () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f6b72cabfa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f6b8811f042 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#3  0x00007f6b88141b54 in ncclIbAsyncThreadMain(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#4  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#5  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 10 (Thread 0x7f602d7fe000 (LWP 2791202) "executor_server"):
#0  0x00007f6b84ad7ab4 in read () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f6b72cabfa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f6b8811f042 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#3  0x00007f6b88141b54 in ncclIbAsyncThreadMain(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#4  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#5  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 9 (Thread 0x7f602dfff000 (LWP 2791199) "executor_server"):
#0  0x00007f6b84ad7ab4 in read () from /home/askhoroshev/wmcore/lib/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x00007f6b72cabfa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f6b8811f042 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#3  0x00007f6b88141b54 in ncclIbAsyncThreadMain(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#4  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#5  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 8 (Thread 0x7f64aaedb000 (LWP 2791197) "executor_server"):
#0  0x00007f6b84ad7ab4 in read () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f6b72cabfa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f6b8811f042 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#3  0x00007f6b88141b54 in ncclIbAsyncThreadMain(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#4  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#5  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 7 (Thread 0x7f64ab6dc000 (LWP 2791195) "executor_server"):
#0  0x00007f6b84ad7ab4 in read () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f6b72cabfa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f6b8811f042 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#3  0x00007f6b88141b54 in ncclIbAsyncThreadMain(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#4  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#5  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 6 (Thread 0x7f64abedd000 (LWP 2791192) "executor_server"):
#0  0x00007f6b84ad7ab4 in read () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f6b72cabfa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f6b8811f042 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#3  0x00007f6b88141b54 in ncclIbAsyncThreadMain(void*) () from /home/askhoroshev/wmcore/lib/libnccl.so.2
#4  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#5  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 5 (Thread 0x7f694efde000 (LWP 2790932) "cuda-EvtHandlr"):
#0  0x00007f6b84ad445c in pthread_cond_wait@@GLIBC_2.3.2 () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#1  0x00007f52129f9d6a in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f52129f7e60 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f52125cece5 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f5212623601 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f52125b4e2c in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f5212711526 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f6b8696cffb in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#8  0x00007f6b867ee6a4 in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#9  0x00007f6b86721ee3 in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#10 0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#11 0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Thread 4 (Thread 0x7f6b6eb7f000 (LWP 2790911) "cuda00006000019"):
#0  0x00007f6b83ef5f41 in poll () from /home/askhoroshev/wmcore/lib/libc.so.6
#1  0x00007f6b8672a1ef in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#2  0x00007f6b867ee64f in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#3  0x00007f6b86721ee3 in ?? () from /home/askhoroshev/wmcore/lib/libcuda.so.1
#4  0x00007f6b84ace1ca in start_thread () from /home/askhoroshev/wmcore/lib/libpthread.so.0
#5  0x00007f6b83e0be73 in clone () from /home/askhoroshev/wmcore/lib/libc.so.6

Is it related NVIDIA/nccl#311?

@akhoroshev akhoroshev changed the title nccl hung [bug] nccl hung Dec 15, 2024
@akhoroshev
Copy link
Contributor Author

Do you have places inside Executor that call nccl AllGather (except calls from tensorrt to AllgatherPlugin)?

@byshiue

@akhoroshev akhoroshev changed the title [bug] nccl hung [bug] nccl allgather hung Dec 23, 2024
@akhoroshev
Copy link
Contributor Author

I inserted sync after nccl allgather

    void ncclStreamSynchronize(cudaStream_t stream, ncclComm_t comm) {
        cudaError_t cudaErr;
        ncclResult_t ncclErr, ncclAsyncErr;
        while (1) {
            cudaErr = cudaStreamQuery(stream);
            if (cudaErr == cudaSuccess)
                return;

            if (cudaErr != cudaErrorNotReady) {
                TLLM_CUDA_CHECK(cudaErr);
            }

            ncclErr = ncclCommGetAsyncError(comm, &ncclAsyncErr);
            if (ncclErr != ncclSuccess) {
                NCCLCHECK(ncclErr);
            }

            if (ncclAsyncErr != ncclSuccess) {
                NCCLCHECK(ncclAsyncErr);
            }

            // We might want to let other threads (including NCCL threads) use the CPU.
            std::this_thread::yield();
        }
    }
     TLLM_CHECK_WITH_INFO(mNcclComm.get() != nullptr, "mNcclComm should be initialized before used");
     NCCLCHECK(ncclAllGather(inputs[0], outputs[0], size, (*getDtypeMap())[inputDesc[0].type], *mNcclComm, stream));
+    ncclStreamSynchronize(stream, *mNcclComm);

It freezes again after about 3kk requests.

backtraces

rank0

#0  0x00007f11879d3b60 in ncclDevFunc_AllGather_RING_SIMPLE() ()
#1  0x00007f0867fcd000 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(544,1,1)>>> ()
Thread 16 (Thread 0x7f083d7fe000 (LWP 1705707) "executionLoop"):
#0  0x00007f13b2a0541d in syscall () from /lib64/libc.so.6
#1  0x00007f07ca5bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f07ca158fd5 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f07ca274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f13b537fd76 in ?? () from /lib64/libcuda.so.1
#5  0x00007f1431c66745 in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f1431bdb43b in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f13c73d6fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#8  0x00007f13c7347a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f13c7349609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f14162d93a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007f14166c508b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f14166d40b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f14166d48de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f14166d501b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f141676a4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f14167715bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f1400042930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#18 0x00007f13b36d91ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007f13b2a058d3 in clone () from /lib64/libc.so.6

rank1

Thread 16 (Thread 0x7f997d9fa000 (LWP 1705706)):
#0  0x00007fa4e83c7b8b in sched_yield () from /lib64/libc.so.6
#1  0x00007fa56759e433 in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#2  0x00007fa4fcd99fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#3  0x00007fa4fcd0aa55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#4  0x00007fa4fcd0c609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#5  0x00007fa54bc9c3a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#6  0x00007fa54c08808b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fa54c0970b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#8  0x00007fa54c0978de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#9  0x00007fa54c09801b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#10 0x00007fa54c12d4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007fa54c1345bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007fa535a05930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#13 0x00007fa4e909c1ca in start_thread () from /lib64/libpthread.so.0
#14 0x00007fa4e83c88d3 in clone () from /lib64/libc.so.6

rank2

Thread 16 (Thread 0x7ef99effd000 (LWP 1705705) "executionLoop"):
#0  0x00007f050ddfb41d in syscall () from /lib64/libc.so.6
#1  0x00007ef90905c730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007ef908bf5f8c in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007ef908d11bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f0510775d76 in ?? () from /lib64/libcuda.so.1
#5  0x00007f058d05c785 in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f058cfd143b in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f05227ccfec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#8  0x00007f052273da55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f052273f609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f05716cf3a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007f0571abb08b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f0571aca0b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f0571aca8de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f0571acb01b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f0571b604a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f0571b675bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f055b438930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#18 0x00007f050eacf1ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007f050ddfb8d3 in clone () from /lib64/libc.so.6

rank3

Thread 16 (Thread 0x7f083d7fe000 (LWP 1705707) "executionLoop"):
#0  0x00007f13b2a0541d in syscall () from /lib64/libc.so.6
#1  0x00007f07ca5bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f07ca1585ec in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f07ca274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f13b524121f in ?? () from /lib64/libcuda.so.1
#5  0x00007f1431c02db5 in libcudart_static_7304beb15f17907bccdc6b7603b23d2a2951bc68 () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f1431c6674d in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f1431bdb43b in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007f13c73d6fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f13c7347a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f13c7349609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007f14162d93a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f14166c508b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f14166d40b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f14166d48de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f14166d501b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f141676a4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f14167715bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007f1400042930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#19 0x00007f13b36d91ca in start_thread () from /lib64/libpthread.so.0
#20 0x00007f13b2a058d3 in clone () from /lib64/libc.so.6

trtllm version: https://github.com/NVIDIA/TensorRT-LLM/tree/340a1b62fc7a4124d6efef10df695b263734ab2f
Driver Version: 560.35.05
CUDA Version: 12.6
nccl version: 2.22.3

@akhoroshev
Copy link
Contributor Author

akhoroshev commented Dec 26, 2024

--- a/cpp/tensorrt_llm/plugins/ncclPlugin/allgatherPlugin.cpp
+++ b/cpp/tensorrt_llm/plugins/ncclPlugin/allgatherPlugin.cpp
@@ -23,6 +23,34 @@ using namespace nvinfer1;
 using tensorrt_llm::plugins::AllgatherPluginCreator;
 using tensorrt_llm::plugins::AllgatherPlugin;
 
+namespace {
+    void ncclStreamSynchronize(cudaStream_t stream, ncclComm_t comm) {
+        cudaError_t cudaErr;
+        ncclResult_t ncclErr, ncclAsyncErr;
+        while (1) {
+            cudaErr = cudaStreamQuery(stream);
+            if (cudaErr == cudaSuccess)
+                return;
+
+            if (cudaErr != cudaErrorNotReady) {
+                TLLM_CUDA_CHECK(cudaErr);
+            }
+
+            ncclErr = ncclCommGetAsyncError(comm, &ncclAsyncErr);
+            if (ncclErr != ncclSuccess) {
+                NCCLCHECK(ncclErr);
+            }
+
+            if (ncclAsyncErr != ncclSuccess) {
+                NCCLCHECK(ncclAsyncErr);
+            }
+
+            // We might want to let other threads (including NCCL threads) use the CPU.
+            std::this_thread::yield();
+        }
+    }
+}
+
 static char const* ALLGATHER_PLUGIN_VERSION{"1"};
 static char const* ALLGATHER_PLUGIN_NAME{"AllGather"};
 PluginFieldCollection AllgatherPluginCreator::mFC{};
@@ -101,8 +129,14 @@ int AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc, nvinfe
         size *= inputDesc[0].dims.d[i];
     }
 
+    TLLM_LOG_WARNING("rank: %d start enqueue", COMM_SESSION.getRank());
+    ncclStreamSynchronize(stream, *mNcclComm);
+    TLLM_LOG_WARNING("rank: %d start ncclAllGather, size: %zu, dtype: %d", COMM_SESSION.getRank(), size, static_cast<int>(inputDesc[0].type));
     TLLM_CHECK_WITH_INFO(mNcclComm.get() != nullptr, "mNcclComm should be initialized before used");
     NCCLCHECK(ncclAllGather(inputs[0], outputs[0], size, (*getDtypeMap())[inputDesc[0].type], *mNcclComm, stream));
+    TLLM_LOG_WARNING("rank: %d end ncclAllGather", COMM_SESSION.getRank());
+    ncclStreamSynchronize(stream, *mNcclComm);
+    TLLM_LOG_WARNING("rank: %d end enqueue", COMM_SESSION.getRank());
 
     return 0;
 }

Hang logs

...
[TensorRT-LLM][WARNING] rank: 1 start enqueue
[TensorRT-LLM][WARNING] rank: 2 start enqueue
[TensorRT-LLM][WARNING] rank: 3 start enqueue
[TensorRT-LLM][WARNING] rank: 0 start enqueue
[TensorRT-LLM][WARNING] rank: 3 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 3 end ncclAllGather
[TensorRT-LLM][WARNING] rank: 2 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 2 end ncclAllGather
[TensorRT-LLM][WARNING] rank: 1 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 1 end ncclAllGather
[TensorRT-LLM][WARNING] rank: 0 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 0 end ncclAllGather

cuda-gdb rank0

(cuda-gdb) info cuda kernels  
  Kernel Parent Dev Grid Status                             SMs Mask  GridDim  BlockDim Invocation                        
*      0      -   0 4347595797 Active 0x000000000000000000000000000fff0fff (24,1,1) (544,1,1) ncclDevKernel_AllGather_RING_LL() 
(cuda-gdb) bt                 
#0  0x00007fce839d3b40 in ncclDevFunc_AllGather_RING_SIMPLE() ()
#1  0x00007fc563fcd000 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(544,1,1)>>> ()
(cuda-gdb) thread apply all bt

Thread 20 (Thread 0x7fc4affff000 (LWP 3318470) "executor_server"):
#0  0x00007fd0b01b8307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fc4f8578984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fc4f8579462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fc4f8577610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fc4f8580aba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fc4f8580fbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fc4f8466020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fc4f85c0997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7fc539fff000 (LWP 3913243) "executor_server"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd0b0aa5870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00000000004370b6 in writerWorker(modules::util::FileDescriptor, std::shared_ptr<modules::util::MTQueue<modules::executor_server::Response> >) ()
#3  0x0000000000440836 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(modules::util::FileDescriptor, std::shared_ptr<modules::util::MTQueue<modules::executor_server::Response> >), modules::util::FileDescriptor, 
std::shared_ptr<modules::util::MTQueue<modules::executor_server::Response> > > > >::_M_run() ()
#4  0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#5  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fc5408fc000 (LWP 3913242) "executor_server"):
#0  0x00007fd0b0d8c7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd113e0fbff in tensorrt_llm::executor::Executor::Impl::awaitResponses(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#2  0x00007fd113dfffdd in tensorrt_llm::executor::Executor::awaitResponses(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x0000000000466f18 in modules::executor_server::Executor::ExecutorImpl::awaitRoutine(std::stop_token) ()
#4  0x0000000000468100 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<modules::executor_server::Executor::ExecutorImpl::ExecutorImpl(modules::executor_server::MainRank, std::filesystem::__cxx11::path const&, 
modules::executor_server::ExecutorConfig const&, modules::executor_server::LogitsProcessorStaticConfig const&)::{lambda(std::stop_token)#1}, std::stop_token> > >::_M_run() ()
#5  0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#6  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#7  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fc5410fd000 (LWP 3913240) "executionLoop"):
#0  0x00007fd0b00b241d in syscall () from /lib64/libc.so.6
#1  0x00007fc4f85bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fc4f81585ec in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fc4f8274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fd0b28ee21f in ?? () from /lib64/libcuda.so.1
#5  0x00007fd12f2b0135 in libcudart_static_7304beb15f17907bccdc6b7603b23d2a2951bc68 () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007fd12f313acd in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007fd12f28752b in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007fd12f28853c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#9  0x00007fd0c4a83fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007fd0c49f4a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007fd0c49f6609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#12 0x00007fd1139863a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007fd113d7208b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007fd113d810b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007fd113d818de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007fd113d8201b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007fd113e174a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007fd113e1e5bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#19 0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
--Type <RET> for more, q to quit, c to continue without paging--
#20 0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#21 0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fc5418fe000 (LWP 3913237) "dataTransResp"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd0b0aa5870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007fd113cfba6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007fd113cf9c3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
#5  0x00007fd0b0d8de97 in __pthread_once_slow () from /lib64/libpthread.so.0
#6  0x00007fd113cfaa8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#8  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fc564ba8000 (LWP 3913229) "executor_server"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd0b4847727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fc5653a9000 (LWP 3913226) "executor_server"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0b484647a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fce88e67000 (LWP 3913225) "executor_server"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0b4847dc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fc574ffd000 (LWP 3913202) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fc5757fe000 (LWP 3913199) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fc575fff000 (LWP 3913197) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc9e6e05000 (LWP 3913195) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
--Type <RET> for more, q to quit, c to continue without paging--
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc9e7606000 (LWP 3913189) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc9e7e07000 (LWP 3913187) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fce8b55e000 (LWP 3913173) "cuda-EvtHandlr"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc4f85794ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fc4f8577610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fc4f818e225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fc4f81e2961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fc4f8173a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fc4f8274bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fd0b2a9e42b in ?? () from /lib64/libcuda.so.1
#8  0x00007fd0b293b184 in ?? () from /lib64/libcuda.so.1
#9  0x00007fd0b28640f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fd09a759000 (LWP 3913167) "cuda00006000019"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0b2867e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007fd0b293b12f in ?? () from /lib64/libcuda.so.1
#3  0x00007fd0b28640f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fd0a658a000 (LWP 3913159) "executor_server"):
#0  0x00007fd0b01b8307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fd0aebcf509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fd0aebc51e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fd0ac7f9276 in ?? () from /lib64/libpmix.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fd0ad491000 (LWP 3913155) "executor_server"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0aebce8e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fd0aebc51e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fd0af265d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fd134703000 (LWP 3913149) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x0000000000543b1a in google::protobuf::io::FileInputStream::CopyingFileInputStream::Read(void*, int) ()
#2  0x0000000000548812 in google::protobuf::io::CopyingInputStreamAdaptor::Next(void const**, int*) ()
#3  0x00000000005418a7 in google::protobuf::io::CodedInputStream::Refresh() [clone .part.0] ()
#4  0x000000000043674e in readerWorker(modules::util::FileDescriptor, modules::executor_server::Executor&, std::shared_ptr<modules::util::LoraLoader>, unsigned long, std::function<void (modules::executor_server::Response&&)> const&) ()
#5  0x000000000042d3bb in main ()

cuda-gdb rank1

(cuda-gdb) info cuda kernels
No CUDA kernels.
(cuda-gdb) thread apply all bt

Thread 19 (Thread 0x7f75a7fff000 (LWP 3273909) "executor_server"):
#0  0x00007f81cb05c307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f75c2578984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f75c2579462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f75c2577610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f75c2580aba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f75c2580fbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f75c2466020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f75c25c0997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f76417fe000 (LWP 3913348) "cuda-EvtHandlr"):
#0  0x00007f81cbc3048c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f75c25794ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f75c2577610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f75c218e225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f75c21e2961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f75c2173a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f75c2274bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f81cd94242b in ?? () from /lib64/libcuda.so.1
#8  0x00007f81cd7df184 in ?? () from /lib64/libcuda.so.1
#9  0x00007f81cd7080f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f7641fff000 (LWP 3913241) "executionLoop"):
#0  0x00007f81caf5641d in syscall () from /lib64/libc.so.6
#1  0x00007f75c25bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f75c21585ec in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f75c2274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f81cd79221f in ?? () from /lib64/libcuda.so.1
#5  0x00007f824a154135 in libcudart_static_7304beb15f17907bccdc6b7603b23d2a2951bc68 () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f824a1b7acd in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f824a12b52b in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007f824a12c53c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#9  0x00007f81df927fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f81df898a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007f81df89a609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#12 0x00007f822e82a3a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f822ec1608b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f822ec250b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f822ec258de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f822ec2601b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f822ecbb4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007f822ecc25bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#19 0x00007f8218593930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#20 0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#21 0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f76609f8000 (LWP 3913235) "dataTransResp"):
#0  0x00007f81cbc3048c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f81cb949870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007f822eb9fa6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f822eb9dc3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
--Type <RET> for more, q to quit, c to continue without paging--
#5  0x00007f81cbc31e97 in __pthread_once_slow () from /lib64/libpthread.so.0
#6  0x00007f822eb9ea8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007f8218593930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#8  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f76618fa000 (LWP 3913228) "executor_server"):
#0  0x00007f81cbc3048c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f81cf6eb727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f76627fc000 (LWP 3913222) "executor_server"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cf6ea47a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f7662ffd000 (LWP 3913220) "executor_server"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cf6ebdc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f76637fe000 (LWP 3913204) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f7663fff000 (LWP 3913200) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f7686ba8000 (LWP 3913193) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f76873a9000 (LWP 3913191) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f7b02d31000 (LWP 3913190) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--

Thread 6 (Thread 0x7f7b03532000 (LWP 3913188) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f7fa62ce000 (LWP 3913177) "cuda-EvtHandlr"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cd70be4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f81cd7df12f in ?? () from /lib64/libcuda.so.1
#3  0x00007f81cd7080f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f81b555b000 (LWP 3913170) "cuda00006000019"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cd70be4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f81cd7df12f in ?? () from /lib64/libcuda.so.1
#3  0x00007f81cd7080f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f81c5520000 (LWP 3913160) "executor_server"):
#0  0x00007f81cb05c307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f81c9a73509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f81c9a691e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f81c769d276 in ?? () from /lib64/libpmix.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f81c8335000 (LWP 3913156) "executor_server"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81c9a728e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f81c9a691e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f81ca109d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f824f5a7000 (LWP 3913150) "executor_server"):
#0  0x00007f81cbc2b6cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007f81cb94fda7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007f822ecbf009 in tensorrt_llm::executor::Executor::Impl::shutdown() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f822ecbf0f6 in tensorrt_llm::executor::Executor::Impl::~Impl() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x00007f822eca3f81 in tensorrt_llm::executor::Executor::~Executor() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#5  0x000000000046247a in modules::executor_server::Executor::ExecutorImpl::~ExecutorImpl() ()
#6  0x000000000046e701 in modules::executor_server::Executor::~Executor() ()
#7  0x0000000000440381 in std::unique_ptr<modules::executor_server::Executor, std::default_delete<modules::executor_server::Executor> >::~unique_ptr() ()
#8  0x000000000042d4fc in main ()

cuda-gdb rank2

(cuda-gdb) info cuda kernels 
No CUDA kernels.
(cuda-gdb) thread apply all bt

Thread 19 (Thread 0x7fb9a902b000 (LWP 3272138) "executor_server"):
#0  0x00007fc5b12a9307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb9a955d984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fb9a955e462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fb9a955c610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fb9a9565aba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fb9a9565fbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fb9a944b020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fb9a95a5997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fba3d9fa000 (LWP 3913346) "cuda-EvtHandlr"):
#0  0x00007fc5b1e7d48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fb9a955e4ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fb9a955c610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fb9a9173225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fb9a91c7961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fb9a9158a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fb9a9259bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fc5b3b8f42b in ?? () from /lib64/libcuda.so.1
#8  0x00007fc5b3a2c184 in ?? () from /lib64/libcuda.so.1
#9  0x00007fc5b39550f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fba3e1fb000 (LWP 3913238) "executionLoop"):
#0  0x00007fc5b11a2b8b in sched_yield () from /lib64/libc.so.6
#1  0x00007fc630378523 in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#2  0x00007fc63037953c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#3  0x00007fc5c5b74fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#4  0x00007fc5c5ae5a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#5  0x00007fc5c5ae7609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#6  0x00007fc614a773a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fc614e6308b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#8  0x00007fc614e720b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#9  0x00007fc614e728de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#10 0x00007fc614e7301b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007fc614f084a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007fc614f0f5bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007fc5fe7e0930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#14 0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#15 0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fba3e9fc000 (LWP 3913234) "dataTransResp"):
#0  0x00007fc5b1e7d48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc5b1b96870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007fc614deca6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007fc614deac3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
#5  0x00007fc5b1e7ee97 in __pthread_once_slow () from /lib64/libpthread.so.0
#6  0x00007fc614deba8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fc5fe7e0930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
--Type <RET> for more, q to quit, c to continue without paging--
#8  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fba3f8fe000 (LWP 3913230) "executor_server"):
#0  0x00007fc5b1e7d48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc5b5938727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fba6aba8000 (LWP 3913221) "executor_server"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b593747a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fba6b3a9000 (LWP 3913219) "executor_server"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b5938dc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fbee6e04000 (LWP 3913215) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fbee7605000 (LWP 3913214) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fbee7e06000 (LWP 3913213) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc388e83000 (LWP 3913212) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc389684000 (LWP 3913211) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc389e85000 (LWP 3913210) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fc38bf7b000 (LWP 3913175) "cuda-EvtHandlr"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b3958e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007fc5b3a2c12f in ?? () from /lib64/libcuda.so.1
#3  0x00007fc5b39550f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fc59b7c6000 (LWP 3913168) "cuda00006000019"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b3958e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007fc5b3a2c12f in ?? () from /lib64/libcuda.so.1
#3  0x00007fc5b39550f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fc5a75c7000 (LWP 3913157) "executor_server"):
#0  0x00007fc5b12a9307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fc5afcc0509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fc5afcb61e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fc5ad8ea276 in ?? () from /lib64/libpmix.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fc5ae582000 (LWP 3913153) "executor_server"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5afcbf8e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fc5afcb61e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fc5b0356d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fc6357f4000 (LWP 3913151) "executor_server"):
#0  0x00007fc5b1e786cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007fc5b1b9cda7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007fc614f0c009 in tensorrt_llm::executor::Executor::Impl::shutdown() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007fc614f0c0f6 in tensorrt_llm::executor::Executor::Impl::~Impl() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x00007fc614ef0f81 in tensorrt_llm::executor::Executor::~Executor() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#5  0x000000000046247a in modules::executor_server::Executor::ExecutorImpl::~ExecutorImpl() ()
#6  0x000000000046e701 in modules::executor_server::Executor::~Executor() ()
#7  0x0000000000440381 in std::unique_ptr<modules::executor_server::Executor, std::default_delete<modules::executor_server::Executor> >::~unique_ptr() ()
#8  0x000000000042d4fc in main ()

cuda-gdb rank3

(cuda-gdb) info cuda kernels
No CUDA kernels.

(cuda-gdb) thread apply all bt
Thread 19 (Thread 0x7f0ec48d9000 (LWP 3268986) "executor_server"):
#0  0x00007f1a39775307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f0e2b015984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f0e2b016462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f0e2b014610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f0e2b01daba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f0e2b01dfbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f0e2af03020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f0e2b05d997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f0ec60fb000 (LWP 3913347) "cuda-EvtHandlr"):
#0  0x00007f1a3a34948c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0e2b0164ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f0e2b014610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f0e2ac2b225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f0e2ac7f961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f0e2ac10a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f0e2ad11bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f1a3c05b42b in ?? () from /lib64/libcuda.so.1
#8  0x00007f1a3bef8184 in ?? () from /lib64/libcuda.so.1
#9  0x00007f1a3be210f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f0ec68fc000 (LWP 3913239) "executionLoop"):
#0  0x00007f1a3966f41d in syscall () from /lib64/libc.so.6
#1  0x00007f0e2b05c730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f0e2abf5fd5 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f0e2ad11bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f1a3bfe9d76 in ?? () from /lib64/libcuda.so.1
#5  0x00007f1ab88d0ac5 in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f1ab884452b in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f1ab884553c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007f1a4e040fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f1a4dfb1a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f1a4dfb3609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007f1a9cf433a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f1a9d32f08b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f1a9d33e0b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f1a9d33e8de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f1a9d33f01b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f1a9d3d44a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f1a9d3db5bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007f1a86cac930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#19 0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#20 0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f0ec70fd000 (LWP 3913236) "dataTransResp"):
#0  0x00007f1a3a34948c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1a3a062870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007f1a9d2b8a6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f1a9d2b6c3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
#5  0x00007f1a3a34ae97 in __pthread_once_slow () from /lib64/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#6  0x00007f1a9d2b7a8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007f1a86cac930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#8  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f0ec7fff000 (LWP 3913231) "executor_server"):
#0  0x00007f1a3a34948c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1a3de04727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f0ef33a9000 (LWP 3913224) "executor_server"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3de0347a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f136ee04000 (LWP 3913223) "executor_server"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3de04dc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f136f605000 (LWP 3913201) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f136fe06000 (LWP 3913198) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f1810882000 (LWP 3913196) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f1811083000 (LWP 3913194) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f1811884000 (LWP 3913192) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

--Type <RET> for more, q to quit, c to continue without paging--
Thread 6 (Thread 0x7f1812085000 (LWP 3913186) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f1a20db1000 (LWP 3913176) "cuda-EvtHandlr"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3be24e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f1a3bef812f in ?? () from /lib64/libcuda.so.1
#3  0x00007f1a3be210f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f1a23ecd000 (LWP 3913169) "cuda00006000019"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3be24e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f1a3bef812f in ?? () from /lib64/libcuda.so.1
#3  0x00007f1a3be210f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f1a2fbf0000 (LWP 3913158) "executor_server"):
#0  0x00007f1a39775307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f1a3818c509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f1a381821e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f1a35db6276 in ?? () from /lib64/libpmix.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f1a36a4e000 (LWP 3913154) "executor_server"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3818b8e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f1a381821e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f1a38822d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f1abdcc0000 (LWP 3913152) "executor_server"):
#0  0x00007f1a3a3446cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007f1a3a068da7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007f1a9d3d8009 in tensorrt_llm::executor::Executor::Impl::shutdown() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f1a9d3d80f6 in tensorrt_llm::executor::Executor::Impl::~Impl() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x00007f1a9d3bcf81 in tensorrt_llm::executor::Executor::~Executor() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#5  0x000000000046247a in modules::executor_server::Executor::ExecutorImpl::~ExecutorImpl() ()
#6  0x000000000046e701 in modules::executor_server::Executor::~Executor() ()
#7  0x0000000000440381 in std::unique_ptr<modules::executor_server::Executor, std::default_delete<modules::executor_server::Executor> >::~unique_ptr() ()
#8  0x000000000042d4fc in main ()

@akhoroshev
Copy link
Contributor Author

This problem was solved for me after setting FORCE_NCCL_ALL_REDUCE_STRATEGY=1 env.

I don't know why custom all reduce affects all gather but the problem is gone.

Probably custom all reduce implementation has UB which affects nccl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant