Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: v0.6.4.post1 crashed:Error in model execution: CUDA error: an illegal memory access was encountered #10389

Open
1 task done
wciq1208 opened this issue Nov 16, 2024 · 32 comments
Labels
bug Something isn't working

Comments

@wciq1208
Copy link

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.35

Python version: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          16
On-line CPU(s) list:             0-15
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz
CPU family:                      6
Model:                           85
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       4
Stepping:                        4
BogoMIPS:                        4599.99
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke md_clear spec_ctrl intel_stibp arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB (16 instances)
L1i cache:                       512 KiB (16 instances)
L2 cache:                        64 MiB (16 instances)
L3 cache:                        64 MiB (4 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; IBRS (kernel), IBPB
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.13.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1+cu124
[pip3] torchaudio==2.5.1+cu124
[pip3] torchelastic==0.2.2
[pip3] torchvision==0.20.1+cu124
[pip3] transformers==4.46.2
[pip3] triton==3.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] optree                    0.13.0                   pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.5.1+cu124              pypi_0    pypi
[conda] torchaudio                2.5.1+cu124              pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchvision               0.20.1+cu124             pypi_0    pypi
[conda] transformers              4.46.2                   pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     0-15    0               N/A
GPU1    PHB      X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
PYTORCH_VERSION=2.5.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
VLLM_PLUGINS=clean_cuda_cache
LD_LIBRARY_PATH=/opt/conda/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
VLLM_RPC_TIMEOUT=600000
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

err_execute_model_input_20241116-081810.zip

🐛 Describe the bug

command

vllm serve /hestia/model/Qwen2.5-14B-Instruct-AWQ --max-model-len 32768 --quantization awq_marlin --port 8001 --served-model-name qwen --num-gpu-blocks-override 2048 --disable-log-requests --swap-space 4 --enable-prefix-caching --enable-chunked-prefill
INFO 11-16 10:37:50 metrics.py:449] Avg prompt throughput: 5941.0 tokens/s, Avg generation throughput: 16.5 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 13 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
INFO 11-16 10:37:50 metrics.py:465] Prefix cache hit rate: GPU: 94.87%, CPU: 0.00%
INFO:     ::1:59242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 11-16 10:37:53 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241116-103753.pkl...
WARNING 11-16 10:37:53 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 11-16 10:37:53 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 11-16 10:37:53 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 11-16 10:37:53 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING 11-16 10:37:53 model_runner_base.py:143] 
CRITICAL 11-16 10:37:53 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:59242 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-16 10:37:53 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:59468 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 11-16 10:37:53 engine.py:135] RuntimeError('Error in model execution: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 11-16 10:37:53 engine.py:135] Traceback (most recent call last):
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-16 10:37:53 engine.py:135]     return func(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1687, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     logits = self.model.compute_logits(hidden_or_intermediate_states,
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 478, in compute_logits
ERROR 11-16 10:37:53 engine.py:135]     logits = self.logits_processor(self.lm_head, hidden_states,
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-16 10:37:53 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-16 10:37:53 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 74, in forward
ERROR 11-16 10:37:53 engine.py:135]     logits = _apply_logits_processors(logits, sampling_metadata)
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 150, in _apply_logits_processors
ERROR 11-16 10:37:53 engine.py:135]     logits_row = logits_processor(past_tokens_ids,
ERROR 11-16 10:37:53 engine.py:135]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 87, in __call__
ERROR 11-16 10:37:53 engine.py:135]     allowed_tokens = torch.tensor(allowed_tokens, device=scores.device)
ERROR 11-16 10:37:53 engine.py:135]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 11-16 10:37:53 engine.py:135] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-16 10:37:53 engine.py:135] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-16 10:37:53 engine.py:135] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] The above exception was the direct cause of the following exception:
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] Traceback (most recent call last):
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 11-16 10:37:53 engine.py:135]     self.run_engine_loop()
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 11-16 10:37:53 engine.py:135]     request_outputs = self.engine_step()
ERROR 11-16 10:37:53 engine.py:135]                       ^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 11-16 10:37:53 engine.py:135]     raise e
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 11-16 10:37:53 engine.py:135]     return self.engine.step()
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 1454, in step
ERROR 11-16 10:37:53 engine.py:135]     outputs = self.model_executor.execute_model(
ERROR 11-16 10:37:53 engine.py:135]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 125, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     output = self.model_runner.execute_model(
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-16 10:37:53 engine.py:135]     return func(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 11-16 10:37:53 engine.py:135]     raise type(err)(f"Error in model execution: "
ERROR 11-16 10:37:53 engine.py:135] RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
ERROR 11-16 10:37:53 engine.py:135] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-16 10:37:53 engine.py:135] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-16 10:37:53 engine.py:135] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 11-16 10:37:53 engine.py:135] 
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [618]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@wciq1208 wciq1208 added the bug Something isn't working label Nov 16, 2024
@DaBossCoda
Copy link

Getting this a lot since 0.6.3. seems to be related to AWQ models.

@llmforever
Copy link

same situation here,can anyone solve this?

@epark001
Copy link

epark001 commented Nov 18, 2024

experiencing this as well. thought this would be fixed by #9532 but still experiencing this since 0.6.3

edit: still experiencing this in 0.6.2

@sunyicode0012
Copy link

I encountered the same problem and was quite confused during the process.
vision: 0.6.3.post1
model: llama-3.1-405B-FP8

@DaBossCoda
Copy link

INFO 11-19 11:15:57 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241119-111557.pkl...
WARNING 11-19 11:15:57 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 11-19 11:15:57 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 11-19 11:15:57 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 11-19 11:15:57 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING 11-19 11:15:57 model_runner_base.py:143]
[rank0]:[E1119 11:15:57.623518240 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@seven1122
Copy link

same problem in 0.6.1,0.6.3.post1,0.6.4.post1

@DaBossCoda
Copy link

happens to me in 0.6.2 too

@badrjd
Copy link

badrjd commented Nov 24, 2024

Same for me on llama 3.1 70b awq from 0.6.1 to 0.6.4.post1

@BIGWangYuDong
Copy link

Same for me on QWen 2.5-72B

@linfan
Copy link

linfan commented Nov 26, 2024

Same issues for QWen-2.5-72B-GPTQ-INT4 with 0.6.4.post1
with enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=False

@badrjd
Copy link

badrjd commented Nov 26, 2024

Going back to 0.6.0 fixed the issue for me, but unfortunately it's quite slower.

@nelsonspbr
Copy link

vllm-0.6.1.post2 works for me while vllm-0.6.4.post1 doesn't. Same run, llama3.1-70b FP16, fixed input size 1024, fixed output 1024, right when I transition from batch size 8 to 16 (when it starts requiring preemption on an H100). Only with FlashAttention from what I can tell; FlashInfer works.

@TopIdiot
Copy link

I meet similar bug in 0.6.3, change attention backend to flashinfer fixed it.

@junior-zsy
Copy link

@sasha0552 It seems that the problem has not been fixed.Can you continue to solve this problem? Thank you

@DaBossCoda
Copy link

@WoosukKwon

@Xingkangze
Copy link

me too

@Ryosuke0104
Copy link

same for me on llava-onevision-qwen2-7b-ov-hf

@WoosukKwon
Copy link
Collaborator

@wciq1208 @DaBossCoda @llmforever @Ryosuke0104 @Xingkangze @junior-zsy @TopIdiot @nelsonspbr @badrjd @linfan @seven1122 Do you have a reproducible script by any chance (also with the exact vllm version)? If it would be nice if it's on a H100 GPU. I tried the command (qwen-2.5-14b) that @wciq1208 posted, but wasn't able to reproduce the bug.

@sunyicode0012
Copy link

@wciq1208 @DaBossCoda @llmforever @Ryosuke0104 @Xingkangze @junior-zsy @TopIdiot @nelsonspbr @badrjd @linfan @seven1122 Do you have a reproducible script by any chance (also with the exact vllm version)? If it would be nice if it's on a H100 GPU. I tried the command (qwen-2.5-14b) that @wciq1208 posted, but wasn't able to reproduce the bug.

My version is: 0.6.3.post1
My device model is: H20
My command is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python -m vllm.entrypoints.openai.api_server --model Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --served-model-name llama-3.1-405B --port 8090 --gpu-memory-utilization 0.96
The above error occurs intermittently now. Looking forward to your reply.

@sunyicode0012
Copy link

@WoosukKwon

@robertgshaw2-redhat
Copy link
Collaborator

@wciq1208 @DaBossCoda @llmforever @Ryosuke0104 @Xingkangze @junior-zsy @TopIdiot @nelsonspbr @badrjd @linfan @seven1122 Do you have a reproducible script by any chance (also with the exact vllm version)? If it would be nice if it's on a H100 GPU. I tried the command (qwen-2.5-14b) that @wciq1208 posted, but wasn't able to reproduce the bug.

I also tried on A100 + L4, havent been able to repro

@yoululai-yhl
Copy link

Same issues for llama 3.1 70B AWQ with 0.6.3.post1
GPU:A800*2
Command:CUDA_VISIBLE_DEVICES=0,1 python3 -m vllm.entrypoints.openai.api_server --model /home/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --tensor-parallel-size 2 --served-model-name llama-3.1-awq --port 30011
Test script:client starts 32, 64, 128, and 200 threads in sequence, each thread sends requests one by one for 1 minutes, Repeat the above test steps for several hours

An CUDA error occurred: an illegal memory access was encountered. The process dies but the GPU memory is not free.
Dmesg info:
[ 7693.770437] NVRM: GPU at PCI:0000:35:00: GPU-b925e9f8-460f-24cc-5d92-d570d1ad30da
[ 7693.771094] NVRM: GPU Board Serial Number: 1323922026673
[ 7693.771856] NVRM: Xid (PCI:0000:35:00): 31, pid=395116, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x7fb2_d7200000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[ 7693.886736] NVRM: GPU at PCI:0000:36:00: GPU-9c5ef0f3-102b-713e-ee79-5457671f51ff
[ 7693.887362] NVRM: GPU Board Serial Number: 1323922027996
[ 7693.887866] NVRM: Xid (PCI:0000:36:00): 31, pid=396221, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x7fb2_e9200000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

script:
test_vllm_openai_random.py.log

@yoululai-yhl
Copy link

@WoosukKwon @robertgshaw2-neuralmagic
Reproduced the issue in version 0.6.3.post1 and 0.6.4.post1 with parameter --max-num-batched-token=2048 and the pormpt which has length of 131 tokens. This behavior confuses me a lot.

A800*2 for llama 3.1 70B AWQ
CUDA_VISIBLE_DEVICES=0,1 python3 -m vllm.entrypoints.openai.api_server --model /home/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --tensor-parallel-size 2 --served-model-name llama-3.1-awq --port 30011 --max-num-batched-token=2048

Test script updated:
test_vllm_openai_random_131.py.log

@derpsteb
Copy link

I could sporadically reproduce the issue using the serving benchmarks.

I used this command to start vllm: docker run -p 8000:8000 -v $(realpath ../Meta-Llama-3.3-70B-Instruct-AWQ-INT4):/model --gpus=all docker.io/vllm/vllm-openai:v0.6.4.post1 --model=/model

And this command to execute the benchmarks: python3 benchmarks/benchmark_serving.py --backend openai-chat --model /model --dataset-name sharegpt --dataset-path ../ShareGPT_V3_unfiltered_cleaned_split.json --host 0.0.0.0 --port 8000 --endpoint /v1/chat/completions --tokenizer=../Meta-Llama-3.3-70B-Instruct-AWQ-INT4 --num-prompts=1000

@yuleiqin
Copy link

yuleiqin commented Dec 26, 2024

same problem here with H20; QWEN2.5-72B; either with BF16 or FP16;

�[1;36m(VllmWorkerProcess pid=294331)�[0;0m INFO 12-26 10:52:10 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241226-105210.pkl...
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] 
ERROR 12-26 10:52:12 multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 294331 died, exit code: -6
INFO 12-26 10:52:12 multiproc_worker_utils.py:120] Killing local vLLM worker processes

@pan-x-c
Copy link

pan-x-c commented Dec 26, 2024

Same problem with H20 and Qwen2.5-72B-instruct.
After using export VLLM_ATTENTION_BACKEND=XFORMERS, the error no longer appears, but performance is degraded. The problem is most likely caused by the FlashAttentionBackend.

@denadai2
Copy link

denadai2 commented Dec 30, 2024

same to me with https://huggingface.co/neuralmagic-ent/Llama-3.3-70B-Instruct-FP8-dynamic

[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m INFO 12-30 18:30:11 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241230-183011.pkl...
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] 
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24189, ip=10.169.47.12)�[0m 2024/12/30 14:16:29 INFO     2024/12/30 14:16:29 INFO datasets:     config.py:58
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24189, ip=10.169.47.12)�[0m                              PyTorch version 2.5.1 available.                   
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m TMA Desc Addr:   0x7b2ed91f7c80
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m format         0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m dim            3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m gmem_address   0x7b42218d3800
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalDim      (4096,699,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalStrides  (1,4096,0,0,0)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m boxDim         (128,128,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m elementStrides (1,1,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m interleave     0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m swizzle        3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m l2Promotion    2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m oobFill        0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Error: Failed to initialize the TMA descriptor 700
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m TMA Desc Addr:   0x7b2ed91f7c80
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m format         0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m dim            3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m gmem_address   0x7b3b02000000
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalDim      (4096,8192,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalStrides  (1,4096,0,0,0)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m boxDim         (128,64,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m elementStrides (1,1,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m interleave     0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m swizzle        3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m l2Promotion    2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m oobFill        0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Error: Failed to initialize the TMA descriptor 700
[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m TMA Desc Addr:   0x7b2ed91f7c80
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m format         9
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m dim            2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m gmem_address   0x7b2b58000000
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalDim      (8192,699,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalStrides  (2,16384,0,0,0)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m boxDim         (32,64,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m elementStrides (1,1,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m interleave     0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m swizzle        2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m l2Promotion    2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m oobFill        0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Error: Failed to initialize the TMA descriptor 700
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m [rank0]:[E1230 18:30:11.182186416 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b4701d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b4701d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b4702178a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b46a7685726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b46a768a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b46a7691b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b46a769361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #7: <unknown function> + 0xdc253 (0x7b7670e84253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #8: <unknown function> + 0x94ac3 (0x7b7672d05ac3 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #9: <unknown function> + 0x126850 (0x7b7672d97850 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m [2024-12-30 18:30:11,026 E 24191 24857] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b4701d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b4701d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b4702178a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b46a7685726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b46a768a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b46a7691b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b46a769361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #7: <unknown function> + 0xdc253 (0x7b7670e84253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #8: <unknown function> + 0x94ac3 (0x7b7672d05ac3 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #9: <unknown function> + 0x126850 (0x7b7672d97850 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b4701d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #1: <unknown function> + 0xe4271b (0x7b46a730071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #2: <unknown function> + 0xdc253 (0x7b7670e84253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #3: <unknown function> + 0x94ac3 (0x7b7672d05ac3 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #4: <unknown function> + 0x126850 (0x7b7672d97850 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m [2024-12-30 18:30:11,037 E 24191 24857] logging.cc:108: Stack trace: 
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x101a47a) [0x7b7671fee47a] ray::o1perator<<()
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x101cf38) [0x7b7671ff0f38] ray::TerminateHandler()
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7b7670e5620c]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7b7670e56277]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7b7670e561fe]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe427c9) [0x7b46a73007c9] c10d::ProcessGroupNCCL::ncclCommWatchdog()
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7b7670e84253]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7b7672d05ac3]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7b7672d97850]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m *** SIGABRT received at time=1735583411 on cpu 141 ***

@caiyueliang
Copy link

Same problem with H20 and Qwen2.5-72B-instruct. After using export VLLM_ATTENTION_BACKEND=XFORMERS, the error no longer appears, but performance is degraded. The problem is most likely caused by the FlashAttentionBackend.

Thank you, this works for me.

@DaBossCoda
Copy link

still getting this :/

@Yuhui0620
Copy link

Same error in 0.6.6.post1 with Qwen2.5-72B-Instruct-GPTQ-Int4, but not appeared in 0.6.3.post1. It seems to happen accidentally though the concurrency is low (e.g. keep 5 Running Request), but sometime can run normally for a long time even in a higher concurrency (e.g keep 20 Running Reuqest).

Here is the trace log, same as @yuleiqin :

vllm.engine.metrics 01-12 22:43:22 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 80.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.5%, CPU KV cache usage: 0.0%.
./vllm.log-121-INFO vllm.engine.metrics 01-12 22:43:22 metrics.py:483] Prefix cache hit rate: GPU: 1.59%, CPU: 0.00%
./vllm.log-122-INFO vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250112-224324.pkl...
./vllm.log-123-WARNING vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
./vllm.log-124-CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
./vllm.log-125-For debugging consider passing CUDA_LAUNCH_BLOCKING=1
./vllm.log-126-Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
./vllm.log-127-
./vllm.log:128:ERROR vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 356 died, exit code: -6
./vllm.log-129-INFO vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:127] Killing local vLLM worker processes

@Yuhui0620
Copy link

Same error in 0.6.6.post1 with Qwen2.5-72B-Instruct-GPTQ-Int4, but not appeared in 0.6.3.post1. It seems to happen accidentally though the concurrency is low (e.g. keep 5 Running Request), but sometime can run normally for a long time even in a higher concurrency (e.g keep 20 Running Reuqest).

Here is the trace log, same as @yuleiqin :

vllm.engine.metrics 01-12 22:43:22 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 80.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.5%, CPU KV cache usage: 0.0%.
./vllm.log-121-INFO vllm.engine.metrics 01-12 22:43:22 metrics.py:483] Prefix cache hit rate: GPU: 1.59%, CPU: 0.00%
./vllm.log-122-INFO vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250112-224324.pkl...
./vllm.log-123-WARNING vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
./vllm.log-124-CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
./vllm.log-125-For debugging consider passing CUDA_LAUNCH_BLOCKING=1
./vllm.log-126-Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
./vllm.log-127-
./vllm.log:128:ERROR vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 356 died, exit code: -6
./vllm.log-129-INFO vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:127] Killing local vLLM worker processes

Additional information about the startup log:

INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:712] vLLM API server version 0.6.6.post1
INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:713] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=True, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/vllm-workspace/model', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['vllm-log-test'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:199] Started engine process with PID 85
WARNING vllm.config 01-14 16:14:17 config.py:2276] Casting torch.float16 to torch.bfloat16.
WARNING vllm.config 01-14 16:14:21 config.py:2276] Casting torch.float16 to torch.bfloat16.
INFO vllm.config 01-14 16:14:23 config.py:510] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:24 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO vllm.config 01-14 16:14:24 config.py:1310] Defaulting to use mp for distributed inference
WARNING vllm.platforms.cuda 01-14 16:14:24 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING vllm.config 01-14 16:14:24 config.py:642] Async output processing is not supported on the current platform type cuda.
INFO vllm.config 01-14 16:14:26 config.py:510] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:27 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO vllm.config 01-14 16:14:27 config.py:1310] Defaulting to use mp for distributed inference
WARNING vllm.platforms.cuda 01-14 16:14:27 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING vllm.config 01-14 16:14:27 config.py:642] Async output processing is not supported on the current platform type cuda.
INFO vllm.engine.llm_engine 01-14 16:14:27 llm_engine.py:235] Initializing an LLM engine (v0.6.6.post1) with config: model='/vllm-workspace/model', speculative_config=None, tokenizer='/vllm-workspace/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=vllm-log-test, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
WARNING vllm.executor.multiproc_worker_utils 01-14 16:14:28 multiproc_worker_utils.py:312] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO vllm.triton_utils.custom_cache_manager 01-14 16:14:28 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO vllm.attention.selector 01-14 16:14:28 selector.py:120] Using Flash Attention backend.
INFO vllm.attention.selector 01-14 16:14:28 selector.py:120] Using Flash Attention backend.
INFO vllm.executor.multiproc_worker_utils 01-14 16:14:28 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
INFO vllm.utils 01-14 16:14:29 utils.py:918] Found nccl from library libnccl.so.2
INFO vllm.utils 01-14 16:14:29 utils.py:918] Found nccl from library libnccl.so.2
INFO vllm.distributed.device_communicators.pynccl 01-14 16:14:29 pynccl.py:69] vLLM is using nccl==2.21.5
INFO vllm.distributed.device_communicators.pynccl 01-14 16:14:29 pynccl.py:69] vLLM is using nccl==2.21.5
INFO vllm.distributed.device_communicators.shm_broadcast 01-14 16:14:29 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_4ced56d5'), local_subscribe_port=37525, remote_subscribe_port=None)
INFO vllm.worker.model_runner 01-14 16:14:29 model_runner.py:1094] Starting to load model /vllm-workspace/model...
INFO vllm.worker.model_runner 01-14 16:14:29 model_runner.py:1094] Starting to load model /vllm-workspace/model...
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:29 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:29 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO vllm.worker.model_runner 01-14 16:14:45 model_runner.py:1099] Loading model weights took 19.2663 GB
INFO vllm.worker.model_runner 01-14 16:14:46 model_runner.py:1099] Loading model weights took 19.2663 GB
INFO vllm.worker.worker 01-14 16:14:51 worker.py:241] Memory profiling takes 5.67 seconds
the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.90) = 35.45GiB
model weights take 19.27GiB; non_torch_memory takes 0.54GiB; PyTorch activation peak memory takes 0.73GiB; the rest of the memory reserved for KV Cache is 14.91GiB.
INFO vllm.worker.worker 01-14 16:14:52 worker.py:241] Memory profiling takes 5.72 seconds
the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.90) = 35.45GiB
model weights take 19.27GiB; non_torch_memory takes 0.56GiB; PyTorch activation peak memory takes 1.45GiB; the rest of the memory reserved for KV Cache is 14.18GiB.
INFO vllm.executor.distributed_gpu_executor 01-14 16:14:52 distributed_gpu_executor.py:57] # GPU blocks: 5807, # CPU blocks: 1638
INFO vllm.executor.distributed_gpu_executor 01-14 16:14:52 distributed_gpu_executor.py:61] Maximum concurrency for 4096 tokens per request: 22.68x
INFO vllm.engine.llm_engine 01-14 16:14:54 llm_engine.py:434] init engine (profile, create kv cache, warmup model) took 8.65 seconds
WARNING vllm.entrypoints.openai.api_server 01-14 16:14:55 api_server.py:589] CAUTION: Enabling X-Request-Id headers in the API Server. This can harm performance at high QPS.
INFO vllm.entrypoints.openai.api_server 01-14 16:14:55 api_server.py:640] Using supplied chat template:
None
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:19] Available routes are:
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /health, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /tokenize, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /detokenize, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/models, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /version, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/completions, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /pooling, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /score, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/score, Methods: POST

@sfc-gh-zhwang
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests