allgather hangs trtllm #1550

akhoroshev · 2024-12-24T18:04:16Z

Trtllm actively uses nccl.

I have an internal stability test that failed after 20+ hours.

Some system info

lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               17
Model name:          AMD EPYC 9474F 48-Core Processor
Stepping:            1
CPU MHz:             3600.000
CPU max MHz:         4113.2808
CPU min MHz:         1500.0000
BogoMIPS:            7199.78
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-47,96-143
NUMA node1 CPU(s):   48-95,144-191
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	0-47,96-143	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PHB	PHB	PHB	0-47,96-143	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	0-47,96-143	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	NODE	NODE	0-47,96-143	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	48-95,144-191	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	48-95,144-191	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	48-95,144-191	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	48-95,144-191	1		N/A
NIC0	NODE	PHB	NODE	NODE	SYS	SYS	SYS	SYS	 X 	PIX	PXB				
NIC1	NODE	PHB	NODE	NODE	SYS	SYS	SYS	SYS	PIX	 X 	PXB				
NIC2	NODE	PHB	NODE	NODE	SYS	SYS	SYS	SYS	PXB	PXB	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_bond_0

nccl version: 2.22.3

About the hang:

|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:04:00.0 Off |                    0 |
| N/A   38C    P0            152W /  700W |   75370MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:23:00.0 Off |                    0 |
| N/A   34C    P0            129W /  700W |   75368MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   37C    P0            137W /  700W |   75368MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:64:00.0 Off |                    0 |
| N/A   36C    P0            135W /  700W |   75368MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2211609      C   /opt/wmcore/executor_server                 75358MiB |
|    1   N/A  N/A   2211610      C   /opt/wmcore/executor_server                 75358MiB |
|    2   N/A  N/A   2211611      C   /opt/wmcore/executor_server                 75358MiB |
|    3   N/A  N/A   2211612      C   /opt/wmcore/executor_server                 75358MiB |

rank0

#0  0x00007f11879d3b60 in ncclDevFunc_AllGather_RING_SIMPLE() ()
#1  0x00007f0867fcd000 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(544,1,1)>>> ()

Thread 16 (Thread 0x7f083d7fe000 (LWP 1705707) "executionLoop"):
#0  0x00007f13b2a0541d in syscall () from /lib64/libc.so.6
#1  0x00007f07ca5bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f07ca158fd5 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f07ca274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f13b537fd76 in ?? () from /lib64/libcuda.so.1
#5  0x00007f1431c66745 in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f1431bdb43b in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f13c73d6fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#8  0x00007f13c7347a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f13c7349609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f14162d93a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007f14166c508b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f14166d40b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f14166d48de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f14166d501b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f141676a4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f14167715bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f1400042930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#18 0x00007f13b36d91ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007f13b2a058d3 in clone () from /lib64/libc.so.6

rank1

Thread 16 (Thread 0x7f997d9fa000 (LWP 1705706)):
#0  0x00007fa4e83c7b8b in sched_yield () from /lib64/libc.so.6
#1  0x00007fa56759e433 in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#2  0x00007fa4fcd99fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#3  0x00007fa4fcd0aa55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#4  0x00007fa4fcd0c609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#5  0x00007fa54bc9c3a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#6  0x00007fa54c08808b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fa54c0970b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#8  0x00007fa54c0978de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#9  0x00007fa54c09801b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#10 0x00007fa54c12d4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007fa54c1345bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007fa535a05930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#13 0x00007fa4e909c1ca in start_thread () from /lib64/libpthread.so.0
#14 0x00007fa4e83c88d3 in clone () from /lib64/libc.so.6

rank2

Thread 16 (Thread 0x7ef99effd000 (LWP 1705705) "executionLoop"):
#0  0x00007f050ddfb41d in syscall () from /lib64/libc.so.6
#1  0x00007ef90905c730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007ef908bf5f8c in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007ef908d11bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f0510775d76 in ?? () from /lib64/libcuda.so.1
#5  0x00007f058d05c785 in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f058cfd143b in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f05227ccfec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#8  0x00007f052273da55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f052273f609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f05716cf3a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007f0571abb08b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f0571aca0b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f0571aca8de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f0571acb01b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f0571b604a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f0571b675bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f055b438930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#18 0x00007f050eacf1ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007f050ddfb8d3 in clone () from /lib64/libc.so.6

rank3

Thread 16 (Thread 0x7f083d7fe000 (LWP 1705707) "executionLoop"):
#0  0x00007f13b2a0541d in syscall () from /lib64/libc.so.6
#1  0x00007f07ca5bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f07ca1585ec in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f07ca274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f13b524121f in ?? () from /lib64/libcuda.so.1
#5  0x00007f1431c02db5 in libcudart_static_7304beb15f17907bccdc6b7603b23d2a2951bc68 () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f1431c6674d in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f1431bdb43b in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007f13c73d6fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f13c7347a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f13c7349609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007f14162d93a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f14166c508b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f14166d40b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f14166d48de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f14166d501b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f141676a4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f14167715bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007f1400042930 in execute_native_thread_routine () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#19 0x00007f13b36d91ca in start_thread () from /lib64/libpthread.so.0
#20 0x00007f13b2a058d3 in clone () from /lib64/libc.so.6

from NVIDIA/TensorRT-LLM#2560 (comment)

My question is simple:

why is rank0 stuck inside cuda kernel?
why rank{1-3} stuck inside ncclStreamSynchronize with no active kernels running?

Is this relatated #311 (comment)?

The text was updated successfully, but these errors were encountered:

akhoroshev · 2024-12-26T06:50:20Z

--- a/cpp/tensorrt_llm/plugins/ncclPlugin/allgatherPlugin.cpp
+++ b/cpp/tensorrt_llm/plugins/ncclPlugin/allgatherPlugin.cpp
@@ -23,6 +23,34 @@ using namespace nvinfer1;
 using tensorrt_llm::plugins::AllgatherPluginCreator;
 using tensorrt_llm::plugins::AllgatherPlugin;
 
+namespace {
+    void ncclStreamSynchronize(cudaStream_t stream, ncclComm_t comm) {
+        cudaError_t cudaErr;
+        ncclResult_t ncclErr, ncclAsyncErr;
+        while (1) {
+            cudaErr = cudaStreamQuery(stream);
+            if (cudaErr == cudaSuccess)
+                return;
+
+            if (cudaErr != cudaErrorNotReady) {
+                TLLM_CUDA_CHECK(cudaErr);
+            }
+
+            ncclErr = ncclCommGetAsyncError(comm, &ncclAsyncErr);
+            if (ncclErr != ncclSuccess) {
+                NCCLCHECK(ncclErr);
+            }
+
+            if (ncclAsyncErr != ncclSuccess) {
+                NCCLCHECK(ncclAsyncErr);
+            }
+
+            // We might want to let other threads (including NCCL threads) use the CPU.
+            std::this_thread::yield();
+        }
+    }
+}
+
 static char const* ALLGATHER_PLUGIN_VERSION{"1"};
 static char const* ALLGATHER_PLUGIN_NAME{"AllGather"};
 PluginFieldCollection AllgatherPluginCreator::mFC{};
@@ -101,8 +129,14 @@ int AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc, nvinfe
         size *= inputDesc[0].dims.d[i];
     }
 
+    TLLM_LOG_WARNING("rank: %d start enqueue", COMM_SESSION.getRank());
+    ncclStreamSynchronize(stream, *mNcclComm);
+    TLLM_LOG_WARNING("rank: %d start ncclAllGather, size: %zu, dtype: %d", COMM_SESSION.getRank(), size, static_cast<int>(inputDesc[0].type));
     TLLM_CHECK_WITH_INFO(mNcclComm.get() != nullptr, "mNcclComm should be initialized before used");
     NCCLCHECK(ncclAllGather(inputs[0], outputs[0], size, (*getDtypeMap())[inputDesc[0].type], *mNcclComm, stream));
+    TLLM_LOG_WARNING("rank: %d end ncclAllGather", COMM_SESSION.getRank());
+    ncclStreamSynchronize(stream, *mNcclComm);
+    TLLM_LOG_WARNING("rank: %d end enqueue", COMM_SESSION.getRank());
 
     return 0;
 }

Hang logs

...
[TensorRT-LLM][WARNING] rank: 1 start enqueue
[TensorRT-LLM][WARNING] rank: 2 start enqueue
[TensorRT-LLM][WARNING] rank: 3 start enqueue
[TensorRT-LLM][WARNING] rank: 0 start enqueue
[TensorRT-LLM][WARNING] rank: 3 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 3 end ncclAllGather
[TensorRT-LLM][WARNING] rank: 2 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 2 end ncclAllGather
[TensorRT-LLM][WARNING] rank: 1 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 1 end ncclAllGather
[TensorRT-LLM][WARNING] rank: 0 start ncclAllGather, size: 1955904, dtype: 1
[TensorRT-LLM][WARNING] rank: 0 end ncclAllGather

cuda-gdb rank0

(cuda-gdb) info cuda kernels  
  Kernel Parent Dev Grid Status                             SMs Mask  GridDim  BlockDim Invocation                        
*      0      -   0 4347595797 Active 0x000000000000000000000000000fff0fff (24,1,1) (544,1,1) ncclDevKernel_AllGather_RING_LL() 
(cuda-gdb) bt                 
#0  0x00007fce839d3b40 in ncclDevFunc_AllGather_RING_SIMPLE() ()
#1  0x00007fc563fcd000 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(544,1,1)>>> ()
(cuda-gdb) thread apply all bt

Thread 20 (Thread 0x7fc4affff000 (LWP 3318470) "executor_server"):
#0  0x00007fd0b01b8307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fc4f8578984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fc4f8579462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fc4f8577610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fc4f8580aba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fc4f8580fbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fc4f8466020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fc4f85c0997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7fc539fff000 (LWP 3913243) "executor_server"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd0b0aa5870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00000000004370b6 in writerWorker(modules::util::FileDescriptor, std::shared_ptr<modules::util::MTQueue<modules::executor_server::Response> >) ()
#3  0x0000000000440836 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(modules::util::FileDescriptor, std::shared_ptr<modules::util::MTQueue<modules::executor_server::Response> >), modules::util::FileDescriptor, 
std::shared_ptr<modules::util::MTQueue<modules::executor_server::Response> > > > >::_M_run() ()
#4  0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#5  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fc5408fc000 (LWP 3913242) "executor_server"):
#0  0x00007fd0b0d8c7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd113e0fbff in tensorrt_llm::executor::Executor::Impl::awaitResponses(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#2  0x00007fd113dfffdd in tensorrt_llm::executor::Executor::awaitResponses(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x0000000000466f18 in modules::executor_server::Executor::ExecutorImpl::awaitRoutine(std::stop_token) ()
#4  0x0000000000468100 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<modules::executor_server::Executor::ExecutorImpl::ExecutorImpl(modules::executor_server::MainRank, std::filesystem::__cxx11::path const&, 
modules::executor_server::ExecutorConfig const&, modules::executor_server::LogitsProcessorStaticConfig const&)::{lambda(std::stop_token)#1}, std::stop_token> > >::_M_run() ()
#5  0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#6  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#7  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fc5410fd000 (LWP 3913240) "executionLoop"):
#0  0x00007fd0b00b241d in syscall () from /lib64/libc.so.6
#1  0x00007fc4f85bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fc4f81585ec in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fc4f8274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fd0b28ee21f in ?? () from /lib64/libcuda.so.1
#5  0x00007fd12f2b0135 in libcudart_static_7304beb15f17907bccdc6b7603b23d2a2951bc68 () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007fd12f313acd in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007fd12f28752b in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007fd12f28853c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#9  0x00007fd0c4a83fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007fd0c49f4a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007fd0c49f6609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#12 0x00007fd1139863a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007fd113d7208b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007fd113d810b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007fd113d818de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007fd113d8201b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007fd113e174a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007fd113e1e5bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#19 0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
--Type <RET> for more, q to quit, c to continue without paging--
#20 0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#21 0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fc5418fe000 (LWP 3913237) "dataTransResp"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd0b0aa5870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007fd113cfba6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007fd113cf9c3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
#5  0x00007fd0b0d8de97 in __pthread_once_slow () from /lib64/libpthread.so.0
#6  0x00007fd113cfaa8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fd0fd6ef930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#8  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fc564ba8000 (LWP 3913229) "executor_server"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd0b4847727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fc5653a9000 (LWP 3913226) "executor_server"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0b484647a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fce88e67000 (LWP 3913225) "executor_server"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0b4847dc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fc574ffd000 (LWP 3913202) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fc5757fe000 (LWP 3913199) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fc575fff000 (LWP 3913197) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc9e6e05000 (LWP 3913195) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
--Type <RET> for more, q to quit, c to continue without paging--
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc9e7606000 (LWP 3913189) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc9e7e07000 (LWP 3913187) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x00007fd09e885fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fd0b48780d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fd0b489ac04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fce8b55e000 (LWP 3913173) "cuda-EvtHandlr"):
#0  0x00007fd0b0d8c48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc4f85794ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fc4f8577610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fc4f818e225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fc4f81e2961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fc4f8173a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fc4f8274bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fd0b2a9e42b in ?? () from /lib64/libcuda.so.1
#8  0x00007fd0b293b184 in ?? () from /lib64/libcuda.so.1
#9  0x00007fd0b28640f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fd09a759000 (LWP 3913167) "cuda00006000019"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0b2867e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007fd0b293b12f in ?? () from /lib64/libcuda.so.1
#3  0x00007fd0b28640f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fd0a658a000 (LWP 3913159) "executor_server"):
#0  0x00007fd0b01b8307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fd0aebcf509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fd0aebc51e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fd0ac7f9276 in ?? () from /lib64/libpmix.so.2
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fd0ad491000 (LWP 3913155) "executor_server"):
#0  0x00007fd0b01abac1 in poll () from /lib64/libc.so.6
#1  0x00007fd0aebce8e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fd0aebc51e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fd0af265d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007fd0b0d861ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd0b00b28d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fd134703000 (LWP 3913149) "executor_server"):
#0  0x00007fd0b0d8fae4 in read () from /lib64/libpthread.so.0
#1  0x0000000000543b1a in google::protobuf::io::FileInputStream::CopyingFileInputStream::Read(void*, int) ()
#2  0x0000000000548812 in google::protobuf::io::CopyingInputStreamAdaptor::Next(void const**, int*) ()
#3  0x00000000005418a7 in google::protobuf::io::CodedInputStream::Refresh() [clone .part.0] ()
#4  0x000000000043674e in readerWorker(modules::util::FileDescriptor, modules::executor_server::Executor&, std::shared_ptr<modules::util::LoraLoader>, unsigned long, std::function<void (modules::executor_server::Response&&)> const&) ()
#5  0x000000000042d3bb in main ()

cuda-gdb rank1

(cuda-gdb) info cuda kernels
No CUDA kernels.
(cuda-gdb) thread apply all bt

Thread 19 (Thread 0x7f75a7fff000 (LWP 3273909) "executor_server"):
#0  0x00007f81cb05c307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f75c2578984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f75c2579462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f75c2577610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f75c2580aba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f75c2580fbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f75c2466020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f75c25c0997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f76417fe000 (LWP 3913348) "cuda-EvtHandlr"):
#0  0x00007f81cbc3048c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f75c25794ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f75c2577610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f75c218e225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f75c21e2961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f75c2173a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f75c2274bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f81cd94242b in ?? () from /lib64/libcuda.so.1
#8  0x00007f81cd7df184 in ?? () from /lib64/libcuda.so.1
#9  0x00007f81cd7080f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f7641fff000 (LWP 3913241) "executionLoop"):
#0  0x00007f81caf5641d in syscall () from /lib64/libc.so.6
#1  0x00007f75c25bf730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f75c21585ec in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f75c2274bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f81cd79221f in ?? () from /lib64/libcuda.so.1
#5  0x00007f824a154135 in libcudart_static_7304beb15f17907bccdc6b7603b23d2a2951bc68 () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f824a1b7acd in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f824a12b52b in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007f824a12c53c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#9  0x00007f81df927fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f81df898a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007f81df89a609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#12 0x00007f822e82a3a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f822ec1608b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f822ec250b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f822ec258de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f822ec2601b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f822ecbb4a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007f822ecc25bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#19 0x00007f8218593930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#20 0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#21 0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f76609f8000 (LWP 3913235) "dataTransResp"):
#0  0x00007f81cbc3048c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f81cb949870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007f822eb9fa6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f822eb9dc3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
--Type <RET> for more, q to quit, c to continue without paging--
#5  0x00007f81cbc31e97 in __pthread_once_slow () from /lib64/libpthread.so.0
#6  0x00007f822eb9ea8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007f8218593930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#8  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f76618fa000 (LWP 3913228) "executor_server"):
#0  0x00007f81cbc3048c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f81cf6eb727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f76627fc000 (LWP 3913222) "executor_server"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cf6ea47a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f7662ffd000 (LWP 3913220) "executor_server"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cf6ebdc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f76637fe000 (LWP 3913204) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f7663fff000 (LWP 3913200) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f7686ba8000 (LWP 3913193) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f76873a9000 (LWP 3913191) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f7b02d31000 (LWP 3913190) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--

Thread 6 (Thread 0x7f7b03532000 (LWP 3913188) "executor_server"):
#0  0x00007f81cbc33ae4 in read () from /lib64/libpthread.so.0
#1  0x00007f81bd827fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f81cf71c0d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f81cf73ec04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f7fa62ce000 (LWP 3913177) "cuda-EvtHandlr"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cd70be4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f81cd7df12f in ?? () from /lib64/libcuda.so.1
#3  0x00007f81cd7080f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f81b555b000 (LWP 3913170) "cuda00006000019"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81cd70be4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f81cd7df12f in ?? () from /lib64/libcuda.so.1
#3  0x00007f81cd7080f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f81c5520000 (LWP 3913160) "executor_server"):
#0  0x00007f81cb05c307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f81c9a73509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f81c9a691e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f81c769d276 in ?? () from /lib64/libpmix.so.2
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f81c8335000 (LWP 3913156) "executor_server"):
#0  0x00007f81cb04fac1 in poll () from /lib64/libc.so.6
#1  0x00007f81c9a728e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f81c9a691e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f81ca109d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007f81cbc2a1ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f81caf568d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f824f5a7000 (LWP 3913150) "executor_server"):
#0  0x00007f81cbc2b6cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007f81cb94fda7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007f822ecbf009 in tensorrt_llm::executor::Executor::Impl::shutdown() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f822ecbf0f6 in tensorrt_llm::executor::Executor::Impl::~Impl() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x00007f822eca3f81 in tensorrt_llm::executor::Executor::~Executor() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#5  0x000000000046247a in modules::executor_server::Executor::ExecutorImpl::~ExecutorImpl() ()
#6  0x000000000046e701 in modules::executor_server::Executor::~Executor() ()
#7  0x0000000000440381 in std::unique_ptr<modules::executor_server::Executor, std::default_delete<modules::executor_server::Executor> >::~unique_ptr() ()
#8  0x000000000042d4fc in main ()

cuda-gdb rank2

(cuda-gdb) info cuda kernels 
No CUDA kernels.
(cuda-gdb) thread apply all bt

Thread 19 (Thread 0x7fb9a902b000 (LWP 3272138) "executor_server"):
#0  0x00007fc5b12a9307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb9a955d984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fb9a955e462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fb9a955c610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fb9a9565aba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fb9a9565fbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fb9a944b020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fb9a95a5997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fba3d9fa000 (LWP 3913346) "cuda-EvtHandlr"):
#0  0x00007fc5b1e7d48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fb9a955e4ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007fb9a955c610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007fb9a9173225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007fb9a91c7961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007fb9a9158a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007fb9a9259bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007fc5b3b8f42b in ?? () from /lib64/libcuda.so.1
#8  0x00007fc5b3a2c184 in ?? () from /lib64/libcuda.so.1
#9  0x00007fc5b39550f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fba3e1fb000 (LWP 3913238) "executionLoop"):
#0  0x00007fc5b11a2b8b in sched_yield () from /lib64/libc.so.6
#1  0x00007fc630378523 in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#2  0x00007fc63037953c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#3  0x00007fc5c5b74fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#4  0x00007fc5c5ae5a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#5  0x00007fc5c5ae7609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#6  0x00007fc614a773a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fc614e6308b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#8  0x00007fc614e720b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#9  0x00007fc614e728de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#10 0x00007fc614e7301b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#11 0x00007fc614f084a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007fc614f0f5bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007fc5fe7e0930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#14 0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#15 0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fba3e9fc000 (LWP 3913234) "dataTransResp"):
#0  0x00007fc5b1e7d48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc5b1b96870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007fc614deca6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007fc614deac3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
#5  0x00007fc5b1e7ee97 in __pthread_once_slow () from /lib64/libpthread.so.0
#6  0x00007fc614deba8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007fc5fe7e0930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
--Type <RET> for more, q to quit, c to continue without paging--
#8  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fba3f8fe000 (LWP 3913230) "executor_server"):
#0  0x00007fc5b1e7d48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc5b5938727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fba6aba8000 (LWP 3913221) "executor_server"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b593747a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fba6b3a9000 (LWP 3913219) "executor_server"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b5938dc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fbee6e04000 (LWP 3913215) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fbee7605000 (LWP 3913214) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fbee7e06000 (LWP 3913213) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc388e83000 (LWP 3913212) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc389684000 (LWP 3913211) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc389e85000 (LWP 3913210) "executor_server"):
#0  0x00007fc5b1e80ae4 in read () from /lib64/libpthread.so.0
#1  0x00007fc59f8f2fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007fc5b59690d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007fc5b598bc04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fc38bf7b000 (LWP 3913175) "cuda-EvtHandlr"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b3958e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007fc5b3a2c12f in ?? () from /lib64/libcuda.so.1
#3  0x00007fc5b39550f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fc59b7c6000 (LWP 3913168) "cuda00006000019"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5b3958e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007fc5b3a2c12f in ?? () from /lib64/libcuda.so.1
#3  0x00007fc5b39550f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fc5a75c7000 (LWP 3913157) "executor_server"):
#0  0x00007fc5b12a9307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fc5afcc0509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fc5afcb61e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fc5ad8ea276 in ?? () from /lib64/libpmix.so.2
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fc5ae582000 (LWP 3913153) "executor_server"):
#0  0x00007fc5b129cac1 in poll () from /lib64/libc.so.6
#1  0x00007fc5afcbf8e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007fc5afcb61e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007fc5b0356d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007fc5b1e771ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc5b11a38d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fc6357f4000 (LWP 3913151) "executor_server"):
#0  0x00007fc5b1e786cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007fc5b1b9cda7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007fc614f0c009 in tensorrt_llm::executor::Executor::Impl::shutdown() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007fc614f0c0f6 in tensorrt_llm::executor::Executor::Impl::~Impl() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x00007fc614ef0f81 in tensorrt_llm::executor::Executor::~Executor() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#5  0x000000000046247a in modules::executor_server::Executor::ExecutorImpl::~ExecutorImpl() ()
#6  0x000000000046e701 in modules::executor_server::Executor::~Executor() ()
#7  0x0000000000440381 in std::unique_ptr<modules::executor_server::Executor, std::default_delete<modules::executor_server::Executor> >::~unique_ptr() ()
#8  0x000000000042d4fc in main ()

cuda-gdb rank3

(cuda-gdb) info cuda kernels
No CUDA kernels.

(cuda-gdb) thread apply all bt
Thread 19 (Thread 0x7f0ec48d9000 (LWP 3268986) "executor_server"):
#0  0x00007f1a39775307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f0e2b015984 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f0e2b016462 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f0e2b014610 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f0e2b01daba in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f0e2b01dfbc in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f0e2af03020 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f0e2b05d997 in ?? () from /lib64/libcudadebugger.so.1
#8  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f0ec60fb000 (LWP 3913347) "cuda-EvtHandlr"):
#0  0x00007f1a3a34948c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0e2b0164ba in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f0e2b014610 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f0e2ac2b225 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f0e2ac7f961 in ?? () from /lib64/libcudadebugger.so.1
#5  0x00007f0e2ac10a9f in ?? () from /lib64/libcudadebugger.so.1
#6  0x00007f0e2ad11bd6 in ?? () from /lib64/libcudadebugger.so.1
#7  0x00007f1a3c05b42b in ?? () from /lib64/libcuda.so.1
#8  0x00007f1a3bef8184 in ?? () from /lib64/libcuda.so.1
#9  0x00007f1a3be210f3 in ?? () from /lib64/libcuda.so.1
#10 0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#11 0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f0ec68fc000 (LWP 3913239) "executionLoop"):
#0  0x00007f1a3966f41d in syscall () from /lib64/libc.so.6
#1  0x00007f0e2b05c730 in ?? () from /lib64/libcudadebugger.so.1
#2  0x00007f0e2abf5fd5 in ?? () from /lib64/libcudadebugger.so.1
#3  0x00007f0e2ad11bd6 in ?? () from /lib64/libcudadebugger.so.1
#4  0x00007f1a3bfe9d76 in ?? () from /lib64/libcuda.so.1
#5  0x00007f1ab88d0ac5 in cudaStreamQuery () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#6  0x00007f1ab884452b in (anonymous namespace)::ncclStreamSynchronize(CUstream_st*, ncclComm*) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#7  0x00007f1ab884553c in tensorrt_llm::plugins::AllgatherPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
#8  0x00007f1a4e040fec in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#9  0x00007f1a4dfb1a55 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#10 0x00007f1a4dfb3609 in ?? () from /usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer.so.10
#11 0x00007f1a9cf433a4 in tensorrt_llm::runtime::TllmRuntime::executeContext(int) const () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#12 0x00007f1a9d32f08b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int, int) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#13 0x00007f1a9d33e0b6 in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > 
> const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#14 0x00007f1a9d33e8de in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#15 0x00007f1a9d33f01b in tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, 
std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#16 0x00007f1a9d3d44a1 in tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#17 0x00007f1a9d3db5bc in tensorrt_llm::executor::Executor::Impl::executionLoop() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#18 0x00007f1a86cac930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#19 0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#20 0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f0ec70fd000 (LWP 3913236) "dataTransResp"):
#0  0x00007f1a3a34948c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1a3a062870 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007f1a9d2b8a6b in tensorrt_llm::batch_manager::DataResponder::Impl::response() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f1a9d2b6c3d in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, 
std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) () 
from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x000000000043821b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
#5  0x00007f1a3a34ae97 in __pthread_once_slow () from /lib64/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#6  0x00007f1a9d2b7a8d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), 
tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)(), tensorrt_llm::batch_manager::DataResponder::Impl*> 
>&&)::{lambda()#1}> > >::_M_run() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#7  0x00007f1a86cac930 in execute_native_thread_routine () from 
/home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so
#8  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f0ec7fff000 (LWP 3913231) "executor_server"):
#0  0x00007f1a3a34948c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1a3de04727 in ncclProxyProgress(void*) () from /lib64/libnccl.so.2
#2  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f0ef33a9000 (LWP 3913224) "executor_server"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3de0347a in ncclProxyServiceUDS(void*) () from /lib64/libnccl.so.2
#2  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f136ee04000 (LWP 3913223) "executor_server"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3de04dc2 in ncclProxyService(void*) () from /lib64/libnccl.so.2
#2  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f136f605000 (LWP 3913201) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f136fe06000 (LWP 3913198) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f1810882000 (LWP 3913196) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f1811083000 (LWP 3913194) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f1811884000 (LWP 3913192) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

--Type <RET> for more, q to quit, c to continue without paging--
Thread 6 (Thread 0x7f1812085000 (LWP 3913186) "executor_server"):
#0  0x00007f1a3a34cae4 in read () from /lib64/libpthread.so.0
#1  0x00007f1a27df1fa3 in ibv_get_async_event () from /lib64/libibverbs.so.1
#2  0x00007f1a3de350d2 in wrap_ibv_get_async_event(ibv_context*, ibv_async_event*) () from /lib64/libnccl.so.2
#3  0x00007f1a3de57c04 in ncclIbAsyncThreadMain(void*) () from /lib64/libnccl.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f1a20db1000 (LWP 3913176) "cuda-EvtHandlr"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3be24e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f1a3bef812f in ?? () from /lib64/libcuda.so.1
#3  0x00007f1a3be210f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f1a23ecd000 (LWP 3913169) "cuda00006000019"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3be24e4f in ?? () from /lib64/libcuda.so.1
#2  0x00007f1a3bef812f in ?? () from /lib64/libcuda.so.1
#3  0x00007f1a3be210f3 in ?? () from /lib64/libcuda.so.1
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f1a2fbf0000 (LWP 3913158) "executor_server"):
#0  0x00007f1a39775307 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f1a3818c509 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f1a381821e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f1a35db6276 in ?? () from /lib64/libpmix.so.2
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f1a36a4e000 (LWP 3913154) "executor_server"):
#0  0x00007f1a39768ac1 in poll () from /lib64/libc.so.6
#1  0x00007f1a3818b8e9 in ?? () from /lib64/libevent_core-2.1.so.6
#2  0x00007f1a381821e8 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#3  0x00007f1a38822d96 in ?? () from /usr/lib64/openmpi/lib/libopen-pal.so.40
#4  0x00007f1a3a3431ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f1a3966f8d3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f1abdcc0000 (LWP 3913152) "executor_server"):
#0  0x00007f1a3a3446cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007f1a3a068da7 in std::thread::join() () from /lib64/libstdc++.so.6
#2  0x00007f1a9d3d8009 in tensorrt_llm::executor::Executor::Impl::shutdown() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#3  0x00007f1a9d3d80f6 in tensorrt_llm::executor::Executor::Impl::~Impl() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#4  0x00007f1a9d3bcf81 in tensorrt_llm::executor::Executor::~Executor() () from /home/askhoroshev/trtllm_github/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so
#5  0x000000000046247a in modules::executor_server::Executor::ExecutorImpl::~ExecutorImpl() ()
#6  0x000000000046e701 in modules::executor_server::Executor::~Executor() ()
#7  0x0000000000440381 in std::unique_ptr<modules::executor_server::Executor, std::default_delete<modules::executor_server::Executor> >::~unique_ptr() ()
#8  0x000000000042d4fc in main ()

sjeaugey · 2025-01-06T08:23:04Z

This is weird. I don't see anything off in the backtrace that would explain the hang. Anything reported in dmesg/syslog?

akhoroshev · 2025-01-06T10:39:18Z

This problem was solved for me after setting FORCE_NCCL_ALL_REDUCE_STRATEGY=1 env.

I don't know why custom all reduce affects all gather but the problem is gone.

Probably custom all reduce implementation has UB which affects nccl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allgather hangs trtllm #1550

allgather hangs trtllm #1550

akhoroshev commented Dec 24, 2024 •

edited

Loading

akhoroshev commented Dec 26, 2024 •

edited

Loading

sjeaugey commented Jan 6, 2025

akhoroshev commented Jan 6, 2025

allgather hangs trtllm #1550

allgather hangs trtllm #1550

Comments

akhoroshev commented Dec 24, 2024 • edited Loading

akhoroshev commented Dec 26, 2024 • edited Loading

sjeaugey commented Jan 6, 2025

akhoroshev commented Jan 6, 2025

akhoroshev commented Dec 24, 2024 •

edited

Loading

akhoroshev commented Dec 26, 2024 •

edited

Loading