[Bug]: v0.6.4.post1 crashed：Error in model execution: CUDA error: an illegal memory access was encountered #10389

wciq1208 · 2024-11-16T11:23:36Z

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.35

Python version: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          16
On-line CPU(s) list:             0-15
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz
CPU family:                      6
Model:                           85
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       4
Stepping:                        4
BogoMIPS:                        4599.99
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke md_clear spec_ctrl intel_stibp arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB (16 instances)
L1i cache:                       512 KiB (16 instances)
L2 cache:                        64 MiB (16 instances)
L3 cache:                        64 MiB (4 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; IBRS (kernel), IBPB
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.13.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1+cu124
[pip3] torchaudio==2.5.1+cu124
[pip3] torchelastic==0.2.2
[pip3] torchvision==0.20.1+cu124
[pip3] transformers==4.46.2
[pip3] triton==3.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] optree                    0.13.0                   pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.5.1+cu124              pypi_0    pypi
[conda] torchaudio                2.5.1+cu124              pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchvision               0.20.1+cu124             pypi_0    pypi
[conda] transformers              4.46.2                   pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     0-15    0               N/A
GPU1    PHB      X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
PYTORCH_VERSION=2.5.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
VLLM_PLUGINS=clean_cuda_cache
LD_LIBRARY_PATH=/opt/conda/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
VLLM_RPC_TIMEOUT=600000
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

err_execute_model_input_20241116-081810.zip

🐛 Describe the bug

command

vllm serve /hestia/model/Qwen2.5-14B-Instruct-AWQ --max-model-len 32768 --quantization awq_marlin --port 8001 --served-model-name qwen --num-gpu-blocks-override 2048 --disable-log-requests --swap-space 4 --enable-prefix-caching --enable-chunked-prefill

INFO 11-16 10:37:50 metrics.py:449] Avg prompt throughput: 5941.0 tokens/s, Avg generation throughput: 16.5 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 13 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
INFO 11-16 10:37:50 metrics.py:465] Prefix cache hit rate: GPU: 94.87%, CPU: 0.00%
INFO:     ::1:59242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 11-16 10:37:53 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241116-103753.pkl...
WARNING 11-16 10:37:53 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 11-16 10:37:53 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 11-16 10:37:53 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 11-16 10:37:53 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING 11-16 10:37:53 model_runner_base.py:143] 
CRITICAL 11-16 10:37:53 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:59242 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-16 10:37:53 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:59468 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 11-16 10:37:53 engine.py:135] RuntimeError('Error in model execution: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 11-16 10:37:53 engine.py:135] Traceback (most recent call last):
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-16 10:37:53 engine.py:135]     return func(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1687, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     logits = self.model.compute_logits(hidden_or_intermediate_states,
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 478, in compute_logits
ERROR 11-16 10:37:53 engine.py:135]     logits = self.logits_processor(self.lm_head, hidden_states,
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-16 10:37:53 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-16 10:37:53 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 74, in forward
ERROR 11-16 10:37:53 engine.py:135]     logits = _apply_logits_processors(logits, sampling_metadata)
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 150, in _apply_logits_processors
ERROR 11-16 10:37:53 engine.py:135]     logits_row = logits_processor(past_tokens_ids,
ERROR 11-16 10:37:53 engine.py:135]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 87, in __call__
ERROR 11-16 10:37:53 engine.py:135]     allowed_tokens = torch.tensor(allowed_tokens, device=scores.device)
ERROR 11-16 10:37:53 engine.py:135]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 11-16 10:37:53 engine.py:135] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-16 10:37:53 engine.py:135] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-16 10:37:53 engine.py:135] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] The above exception was the direct cause of the following exception:
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] Traceback (most recent call last):
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 11-16 10:37:53 engine.py:135]     self.run_engine_loop()
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 11-16 10:37:53 engine.py:135]     request_outputs = self.engine_step()
ERROR 11-16 10:37:53 engine.py:135]                       ^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 11-16 10:37:53 engine.py:135]     raise e
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 11-16 10:37:53 engine.py:135]     return self.engine.step()
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 1454, in step
ERROR 11-16 10:37:53 engine.py:135]     outputs = self.model_executor.execute_model(
ERROR 11-16 10:37:53 engine.py:135]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 125, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     output = self.model_runner.execute_model(
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-16 10:37:53 engine.py:135]     return func(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 11-16 10:37:53 engine.py:135]     raise type(err)(f"Error in model execution: "
ERROR 11-16 10:37:53 engine.py:135] RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
ERROR 11-16 10:37:53 engine.py:135] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-16 10:37:53 engine.py:135] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-16 10:37:53 engine.py:135] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 11-16 10:37:53 engine.py:135] 
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [618]

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DaBossCoda · 2024-11-17T12:49:20Z

Getting this a lot since 0.6.3. seems to be related to AWQ models.

llmforever · 2024-11-18T02:41:26Z

same situation here，can anyone solve this？

epark001 · 2024-11-18T15:20:07Z

experiencing this as well. thought this would be fixed by #9532 but still experiencing this since 0.6.3

edit: still experiencing this in 0.6.2

sunyicode0012 · 2024-11-19T04:49:38Z

I encountered the same problem and was quite confused during the process.
vision: 0.6.3.post1
model: llama-3.1-405B-FP8

DaBossCoda · 2024-11-19T14:35:45Z

INFO 11-19 11:15:57 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241119-111557.pkl...
WARNING 11-19 11:15:57 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 11-19 11:15:57 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 11-19 11:15:57 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 11-19 11:15:57 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING 11-19 11:15:57 model_runner_base.py:143]
[rank0]:[E1119 11:15:57.623518240 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

seven1122 · 2024-11-22T03:08:17Z

same problem in 0.6.1，0.6.3.post1，0.6.4.post1

DaBossCoda · 2024-11-22T16:10:15Z

happens to me in 0.6.2 too

badrjd · 2024-11-24T19:53:22Z

Same for me on llama 3.1 70b awq from 0.6.1 to 0.6.4.post1

BIGWangYuDong · 2024-11-25T10:01:21Z

Same for me on QWen 2.5-72B

linfan · 2024-11-26T15:31:21Z

Same issues for QWen-2.5-72B-GPTQ-INT4 with 0.6.4.post1
with enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=False

badrjd · 2024-11-26T21:54:14Z

Going back to 0.6.0 fixed the issue for me, but unfortunately it's quite slower.

nelsonspbr · 2024-11-27T18:38:14Z

vllm-0.6.1.post2 works for me while vllm-0.6.4.post1 doesn't. Same run, llama3.1-70b FP16, fixed input size 1024, fixed output 1024, right when I transition from batch size 8 to 16 (when it starts requiring preemption on an H100). Only with FlashAttention from what I can tell; FlashInfer works.

TopIdiot · 2024-11-29T04:43:40Z

I meet similar bug in 0.6.3, change attention backend to flashinfer fixed it.

junior-zsy · 2024-11-29T08:39:02Z

@sasha0552 It seems that the problem has not been fixed.Can you continue to solve this problem? Thank you

DaBossCoda · 2024-12-01T05:52:40Z

@WoosukKwon

Xingkangze · 2024-12-07T08:08:08Z

me too

Ryosuke0104 · 2024-12-08T08:50:22Z

same for me on llava-onevision-qwen2-7b-ov-hf

WoosukKwon · 2024-12-12T07:34:13Z

@wciq1208 @DaBossCoda @llmforever @Ryosuke0104 @Xingkangze @junior-zsy @TopIdiot @nelsonspbr @badrjd @linfan @seven1122 Do you have a reproducible script by any chance (also with the exact vllm version)? If it would be nice if it's on a H100 GPU. I tried the command (qwen-2.5-14b) that @wciq1208 posted, but wasn't able to reproduce the bug.

sunyicode0012 · 2024-12-12T08:38:41Z

@wciq1208 @DaBossCoda @llmforever @Ryosuke0104 @Xingkangze @junior-zsy @TopIdiot @nelsonspbr @badrjd @linfan @seven1122 Do you have a reproducible script by any chance (also with the exact vllm version)? If it would be nice if it's on a H100 GPU. I tried the command (qwen-2.5-14b) that @wciq1208 posted, but wasn't able to reproduce the bug.

My version is: 0.6.3.post1
My device model is: H20
My command is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python -m vllm.entrypoints.openai.api_server --model Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --served-model-name llama-3.1-405B --port 8090 --gpu-memory-utilization 0.96
The above error occurs intermittently now. Looking forward to your reply.

sunyicode0012 · 2024-12-12T08:39:00Z

@WoosukKwon

robertgshaw2-redhat · 2024-12-12T14:12:24Z

@wciq1208 @DaBossCoda @llmforever @Ryosuke0104 @Xingkangze @junior-zsy @TopIdiot @nelsonspbr @badrjd @linfan @seven1122 Do you have a reproducible script by any chance (also with the exact vllm version)? If it would be nice if it's on a H100 GPU. I tried the command (qwen-2.5-14b) that @wciq1208 posted, but wasn't able to reproduce the bug.

I also tried on A100 + L4, havent been able to repro

yoululai-yhl · 2024-12-12T18:38:23Z

Same issues for llama 3.1 70B AWQ with 0.6.3.post1
GPU：A800*2
Command：CUDA_VISIBLE_DEVICES=0,1 python3 -m vllm.entrypoints.openai.api_server --model /home/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --tensor-parallel-size 2 --served-model-name llama-3.1-awq --port 30011
Test script：client starts 32, 64, 128, and 200 threads in sequence, each thread sends requests one by one for 1 minutes， Repeat the above test steps for several hours

An CUDA error occurred: an illegal memory access was encountered. The process dies but the GPU memory is not free.
Dmesg info：
[ 7693.770437] NVRM: GPU at PCI:0000:35:00: GPU-b925e9f8-460f-24cc-5d92-d570d1ad30da
[ 7693.771094] NVRM: GPU Board Serial Number: 1323922026673
[ 7693.771856] NVRM: Xid (PCI:0000:35:00): 31, pid=395116, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x7fb2_d7200000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[ 7693.886736] NVRM: GPU at PCI:0000:36:00: GPU-9c5ef0f3-102b-713e-ee79-5457671f51ff
[ 7693.887362] NVRM: GPU Board Serial Number: 1323922027996
[ 7693.887866] NVRM: Xid (PCI:0000:36:00): 31, pid=396221, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x7fb2_e9200000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

script:
test_vllm_openai_random.py.log

yoululai-yhl · 2024-12-15T07:39:55Z

@WoosukKwon @robertgshaw2-neuralmagic
Reproduced the issue in version 0.6.3.post1 and 0.6.4.post1 with parameter --max-num-batched-token=2048 and the pormpt which has length of 131 tokens. This behavior confuses me a lot.

A800*2 for llama 3.1 70B AWQ
CUDA_VISIBLE_DEVICES=0,1 python3 -m vllm.entrypoints.openai.api_server --model /home/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --tensor-parallel-size 2 --served-model-name llama-3.1-awq --port 30011 --max-num-batched-token=2048

Test script updated：
test_vllm_openai_random_131.py.log

derpsteb · 2024-12-17T11:16:54Z

I could sporadically reproduce the issue using the serving benchmarks.

1 H100
vllm 0.6.4.post1
cuda 550.90.07
llama 3.3 awq.

I used this command to start vllm: docker run -p 8000:8000 -v $(realpath ../Meta-Llama-3.3-70B-Instruct-AWQ-INT4):/model --gpus=all docker.io/vllm/vllm-openai:v0.6.4.post1 --model=/model

And this command to execute the benchmarks: python3 benchmarks/benchmark_serving.py --backend openai-chat --model /model --dataset-name sharegpt --dataset-path ../ShareGPT_V3_unfiltered_cleaned_split.json --host 0.0.0.0 --port 8000 --endpoint /v1/chat/completions --tokenizer=../Meta-Llama-3.3-70B-Instruct-AWQ-INT4 --num-prompts=1000

yuleiqin · 2024-12-26T03:32:10Z

same problem here with H20; QWEN2.5-72B; either with BF16 or FP16;

�[1;36m(VllmWorkerProcess pid=294331)�[0;0m INFO 12-26 10:52:10 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241226-105210.pkl...
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

�[1;36m(VllmWorkerProcess pid=294331)�[0;0m WARNING 12-26 10:52:10 model_runner_base.py:143] 
ERROR 12-26 10:52:12 multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 294331 died, exit code: -6
INFO 12-26 10:52:12 multiproc_worker_utils.py:120] Killing local vLLM worker processes

pan-x-c · 2024-12-26T07:59:48Z

Same problem with H20 and Qwen2.5-72B-instruct.
After using export VLLM_ATTENTION_BACKEND=XFORMERS, the error no longer appears, but performance is degraded. The problem is most likely caused by the FlashAttentionBackend.

denadai2 · 2024-12-30T20:34:46Z

same to me with https://huggingface.co/neuralmagic-ent/Llama-3.3-70B-Instruct-FP8-dynamic

[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m INFO 12-30 18:30:11 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241230-183011.pkl...
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m WARNING 12-30 18:30:11 model_runner_base.py:143] 
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24189, ip=10.169.47.12)�[0m 2024/12/30 14:16:29 INFO     2024/12/30 14:16:29 INFO datasets:     config.py:58
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24189, ip=10.169.47.12)�[0m                              PyTorch version 2.5.1 available.                   
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m TMA Desc Addr:   0x7b2ed91f7c80
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m format         0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m dim            3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m gmem_address   0x7b42218d3800
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalDim      (4096,699,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalStrides  (1,4096,0,0,0)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m boxDim         (128,128,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m elementStrides (1,1,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m interleave     0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m swizzle        3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m l2Promotion    2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m oobFill        0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Error: Failed to initialize the TMA descriptor 700
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m TMA Desc Addr:   0x7b2ed91f7c80
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m format         0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m dim            3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m gmem_address   0x7b3b02000000
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalDim      (4096,8192,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalStrides  (1,4096,0,0,0)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m boxDim         (128,64,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m elementStrides (1,1,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m interleave     0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m swizzle        3
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m l2Promotion    2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m oobFill        0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Error: Failed to initialize the TMA descriptor 700
[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m TMA Desc Addr:   0x7b2ed91f7c80
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m format         9
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m dim            2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m gmem_address   0x7b2b58000000
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalDim      (8192,699,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m globalStrides  (2,16384,0,0,0)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m boxDim         (32,64,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m elementStrides (1,1,1,1,1)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m interleave     0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m swizzle        2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m l2Promotion    2
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m oobFill        0
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Error: Failed to initialize the TMA descriptor 700
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m [rank0]:[E1230 18:30:11.182186416 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b4701d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b4701d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b4702178a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b46a7685726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b46a768a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b46a7691b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b46a769361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #7: <unknown function> + 0xdc253 (0x7b7670e84253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #8: <unknown function> + 0x94ac3 (0x7b7672d05ac3 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #9: <unknown function> + 0x126850 (0x7b7672d97850 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m [2024-12-30 18:30:11,026 E 24191 24857] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b4701d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b4701d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b4702178a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b46a7685726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b46a768a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b46a7691b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b46a769361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #7: <unknown function> + 0xdc253 (0x7b7670e84253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #8: <unknown function> + 0x94ac3 (0x7b7672d05ac3 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #9: <unknown function> + 0x126850 (0x7b7672d97850 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b4701d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #1: <unknown function> + 0xe4271b (0x7b46a730071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #2: <unknown function> + 0xdc253 (0x7b7670e84253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #3: <unknown function> + 0x94ac3 (0x7b7672d05ac3 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m frame #4: <unknown function> + 0x126850 (0x7b7672d97850 in /lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m [2024-12-30 18:30:11,037 E 24191 24857] logging.cc:108: Stack trace: 
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x101a47a) [0x7b7671fee47a] ray::o1perator<<()
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x101cf38) [0x7b7671ff0f38] ray::TerminateHandler()
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7b7670e5620c]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7b7670e56277]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7b7670e561fe]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe427c9) [0x7b46a73007c9] c10d::ProcessGroupNCCL::ncclCommWatchdog()
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7b7670e84253]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7b7672d05ac3]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7b7672d97850]
�[36m(MapWorker(MapBatches(VLLMPredictor)) pid=24191, ip=10.169.47.12)�[0m *** SIGABRT received at time=1735583411 on cpu 141 ***

caiyueliang · 2025-01-06T02:01:33Z

Same problem with H20 and Qwen2.5-72B-instruct. After using export VLLM_ATTENTION_BACKEND=XFORMERS, the error no longer appears, but performance is degraded. The problem is most likely caused by the FlashAttentionBackend.

Thank you, this works for me.

DaBossCoda · 2025-01-11T12:35:26Z

still getting this :/

Yuhui0620 · 2025-01-14T09:40:43Z

Same error in 0.6.6.post1 with Qwen2.5-72B-Instruct-GPTQ-Int4, but not appeared in 0.6.3.post1. It seems to happen accidentally though the concurrency is low (e.g. keep 5 Running Request), but sometime can run normally for a long time even in a higher concurrency (e.g keep 20 Running Reuqest).

Here is the trace log, same as @yuleiqin :

vllm.engine.metrics 01-12 22:43:22 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 80.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.5%, CPU KV cache usage: 0.0%.
./vllm.log-121-INFO vllm.engine.metrics 01-12 22:43:22 metrics.py:483] Prefix cache hit rate: GPU: 1.59%, CPU: 0.00%
./vllm.log-122-INFO vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250112-224324.pkl...
./vllm.log-123-WARNING vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
./vllm.log-124-CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
./vllm.log-125-For debugging consider passing CUDA_LAUNCH_BLOCKING=1
./vllm.log-126-Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
./vllm.log-127-
./vllm.log:128:ERROR vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 356 died, exit code: -6
./vllm.log-129-INFO vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:127] Killing local vLLM worker processes

Yuhui0620 · 2025-01-15T08:51:39Z

Same error in 0.6.6.post1 with Qwen2.5-72B-Instruct-GPTQ-Int4, but not appeared in 0.6.3.post1. It seems to happen accidentally though the concurrency is low (e.g. keep 5 Running Request), but sometime can run normally for a long time even in a higher concurrency (e.g keep 20 Running Reuqest).

Here is the trace log, same as @yuleiqin :

vllm.engine.metrics 01-12 22:43:22 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 80.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.5%, CPU KV cache usage: 0.0%.
./vllm.log-121-INFO vllm.engine.metrics 01-12 22:43:22 metrics.py:483] Prefix cache hit rate: GPU: 1.59%, CPU: 0.00%
./vllm.log-122-INFO vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250112-224324.pkl...
./vllm.log-123-WARNING vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
./vllm.log-124-CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
./vllm.log-125-For debugging consider passing CUDA_LAUNCH_BLOCKING=1
./vllm.log-126-Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
./vllm.log-127-
./vllm.log:128:ERROR vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 356 died, exit code: -6
./vllm.log-129-INFO vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:127] Killing local vLLM worker processes

Additional information about the startup log:

INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:712] vLLM API server version 0.6.6.post1
INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:713] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=True, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/vllm-workspace/model', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['vllm-log-test'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:199] Started engine process with PID 85
WARNING vllm.config 01-14 16:14:17 config.py:2276] Casting torch.float16 to torch.bfloat16.
WARNING vllm.config 01-14 16:14:21 config.py:2276] Casting torch.float16 to torch.bfloat16.
INFO vllm.config 01-14 16:14:23 config.py:510] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:24 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO vllm.config 01-14 16:14:24 config.py:1310] Defaulting to use mp for distributed inference
WARNING vllm.platforms.cuda 01-14 16:14:24 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING vllm.config 01-14 16:14:24 config.py:642] Async output processing is not supported on the current platform type cuda.
INFO vllm.config 01-14 16:14:26 config.py:510] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:27 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO vllm.config 01-14 16:14:27 config.py:1310] Defaulting to use mp for distributed inference
WARNING vllm.platforms.cuda 01-14 16:14:27 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING vllm.config 01-14 16:14:27 config.py:642] Async output processing is not supported on the current platform type cuda.
INFO vllm.engine.llm_engine 01-14 16:14:27 llm_engine.py:235] Initializing an LLM engine (v0.6.6.post1) with config: model='/vllm-workspace/model', speculative_config=None, tokenizer='/vllm-workspace/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=vllm-log-test, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
WARNING vllm.executor.multiproc_worker_utils 01-14 16:14:28 multiproc_worker_utils.py:312] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO vllm.triton_utils.custom_cache_manager 01-14 16:14:28 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO vllm.attention.selector 01-14 16:14:28 selector.py:120] Using Flash Attention backend.
INFO vllm.attention.selector 01-14 16:14:28 selector.py:120] Using Flash Attention backend.
INFO vllm.executor.multiproc_worker_utils 01-14 16:14:28 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
INFO vllm.utils 01-14 16:14:29 utils.py:918] Found nccl from library libnccl.so.2
INFO vllm.utils 01-14 16:14:29 utils.py:918] Found nccl from library libnccl.so.2
INFO vllm.distributed.device_communicators.pynccl 01-14 16:14:29 pynccl.py:69] vLLM is using nccl==2.21.5
INFO vllm.distributed.device_communicators.pynccl 01-14 16:14:29 pynccl.py:69] vLLM is using nccl==2.21.5
INFO vllm.distributed.device_communicators.shm_broadcast 01-14 16:14:29 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_4ced56d5'), local_subscribe_port=37525, remote_subscribe_port=None)
INFO vllm.worker.model_runner 01-14 16:14:29 model_runner.py:1094] Starting to load model /vllm-workspace/model...
INFO vllm.worker.model_runner 01-14 16:14:29 model_runner.py:1094] Starting to load model /vllm-workspace/model...
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:29 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:29 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO vllm.worker.model_runner 01-14 16:14:45 model_runner.py:1099] Loading model weights took 19.2663 GB
INFO vllm.worker.model_runner 01-14 16:14:46 model_runner.py:1099] Loading model weights took 19.2663 GB
INFO vllm.worker.worker 01-14 16:14:51 worker.py:241] Memory profiling takes 5.67 seconds
the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.90) = 35.45GiB
model weights take 19.27GiB; non_torch_memory takes 0.54GiB; PyTorch activation peak memory takes 0.73GiB; the rest of the memory reserved for KV Cache is 14.91GiB.
INFO vllm.worker.worker 01-14 16:14:52 worker.py:241] Memory profiling takes 5.72 seconds
the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.90) = 35.45GiB
model weights take 19.27GiB; non_torch_memory takes 0.56GiB; PyTorch activation peak memory takes 1.45GiB; the rest of the memory reserved for KV Cache is 14.18GiB.
INFO vllm.executor.distributed_gpu_executor 01-14 16:14:52 distributed_gpu_executor.py:57] # GPU blocks: 5807, # CPU blocks: 1638
INFO vllm.executor.distributed_gpu_executor 01-14 16:14:52 distributed_gpu_executor.py:61] Maximum concurrency for 4096 tokens per request: 22.68x
INFO vllm.engine.llm_engine 01-14 16:14:54 llm_engine.py:434] init engine (profile, create kv cache, warmup model) took 8.65 seconds
WARNING vllm.entrypoints.openai.api_server 01-14 16:14:55 api_server.py:589] CAUTION: Enabling X-Request-Id headers in the API Server. This can harm performance at high QPS.
INFO vllm.entrypoints.openai.api_server 01-14 16:14:55 api_server.py:640] Using supplied chat template:
None
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:19] Available routes are:
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /health, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /tokenize, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /detokenize, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/models, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /version, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/completions, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /pooling, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /score, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/score, Methods: POST

sfc-gh-zhwang · 2025-01-28T01:03:10Z

one more datapoint and log: https://gist.github.com/sfc-gh-zhwang/de5ee2ce397d50e2e9c44b2a43a7bfe7

wciq1208 added the bug Something isn't working label Nov 16, 2024

Quang-elec44 mentioned this issue Jan 21, 2025

[Bug]: RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered #12233

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: v0.6.4.post1 crashed：Error in model execution: CUDA error: an illegal memory access was encountered #10389

[Bug]: v0.6.4.post1 crashed：Error in model execution: CUDA error: an illegal memory access was encountered #10389

wciq1208 commented Nov 16, 2024

DaBossCoda commented Nov 17, 2024

llmforever commented Nov 18, 2024

epark001 commented Nov 18, 2024 •

edited

Loading

sunyicode0012 commented Nov 19, 2024

DaBossCoda commented Nov 19, 2024

seven1122 commented Nov 22, 2024

DaBossCoda commented Nov 22, 2024

badrjd commented Nov 24, 2024

BIGWangYuDong commented Nov 25, 2024

linfan commented Nov 26, 2024 •

edited

Loading

badrjd commented Nov 26, 2024

nelsonspbr commented Nov 27, 2024

TopIdiot commented Nov 29, 2024

junior-zsy commented Nov 29, 2024

DaBossCoda commented Dec 1, 2024

Xingkangze commented Dec 7, 2024

Ryosuke0104 commented Dec 8, 2024

WoosukKwon commented Dec 12, 2024

sunyicode0012 commented Dec 12, 2024

sunyicode0012 commented Dec 12, 2024

robertgshaw2-redhat commented Dec 12, 2024

yoululai-yhl commented Dec 12, 2024

yoululai-yhl commented Dec 15, 2024

derpsteb commented Dec 17, 2024

yuleiqin commented Dec 26, 2024 •

edited

Loading

pan-x-c commented Dec 26, 2024

denadai2 commented Dec 30, 2024 •

edited

Loading

caiyueliang commented Jan 6, 2025

DaBossCoda commented Jan 11, 2025

Yuhui0620 commented Jan 14, 2025

Yuhui0620 commented Jan 15, 2025

sfc-gh-zhwang commented Jan 28, 2025

[Bug]: v0.6.4.post1 crashed：Error in model execution: CUDA error: an illegal memory access was encountered #10389

[Bug]: v0.6.4.post1 crashed：Error in model execution: CUDA error: an illegal memory access was encountered #10389

Comments

wciq1208 commented Nov 16, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

DaBossCoda commented Nov 17, 2024

llmforever commented Nov 18, 2024

epark001 commented Nov 18, 2024 • edited Loading

sunyicode0012 commented Nov 19, 2024

DaBossCoda commented Nov 19, 2024

seven1122 commented Nov 22, 2024

DaBossCoda commented Nov 22, 2024

badrjd commented Nov 24, 2024

BIGWangYuDong commented Nov 25, 2024

linfan commented Nov 26, 2024 • edited Loading

badrjd commented Nov 26, 2024

nelsonspbr commented Nov 27, 2024

TopIdiot commented Nov 29, 2024

junior-zsy commented Nov 29, 2024

DaBossCoda commented Dec 1, 2024

Xingkangze commented Dec 7, 2024

Ryosuke0104 commented Dec 8, 2024

WoosukKwon commented Dec 12, 2024

sunyicode0012 commented Dec 12, 2024

sunyicode0012 commented Dec 12, 2024

robertgshaw2-redhat commented Dec 12, 2024

yoululai-yhl commented Dec 12, 2024

yoululai-yhl commented Dec 15, 2024

derpsteb commented Dec 17, 2024

yuleiqin commented Dec 26, 2024 • edited Loading

pan-x-c commented Dec 26, 2024

denadai2 commented Dec 30, 2024 • edited Loading

caiyueliang commented Jan 6, 2025

DaBossCoda commented Jan 11, 2025

Yuhui0620 commented Jan 14, 2025

Yuhui0620 commented Jan 15, 2025

sfc-gh-zhwang commented Jan 28, 2025

epark001 commented Nov 18, 2024 •

edited

Loading

linfan commented Nov 26, 2024 •

edited

Loading

yuleiqin commented Dec 26, 2024 •

edited

Loading

denadai2 commented Dec 30, 2024 •

edited

Loading