-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: CUDA illegal memory access error when enable_prefix_caching=True
#5537
Comments
can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to investigate which Python function causes the crash? |
Also please try the latest patch v0.5.0.post1 which might fix one of the root cause |
Thanks for the quick response from both of you! Through experiments, I managed to further narrow the issue down - it only occurs when @simon-mo I'm already using the pip version @youkaichao Here's the output with the env vars from your link set:
|
enable_prefix_caching=True
@robertgshaw2-neuralmagic for prefix caching. |
Will try this when I have the chance: #5376 (comment) |
This is happening to me as well. Here's the error log I have.
|
@ashgold does the model have quantized KV cache or just the layers? |
@robertgshaw2-neuralmagic |
I had the same issue, I guess the following two features have conflicts with each other:
So we have to disable one. It means, we have to use |
Hi @mpoemsl, have you resolved the issue? If so, what steps did you take to do so? If not, can you please provide steps to reproduce it using the following template (replacing the italicized values with your values)?
|
I had the same issue in deepseek, i use enable_prefix_caching=True , fp8 and tp = 2 vllm version is 0.5.1 |
Is this issue solved in the latest version? |
hi guys. |
VLLM_ATTENTION_BACKEND=XFORMERS gives suboptimal latencies - and prefix caching seems to not give any gains. |
@simon-mo if you could please share some insights, it would be immensely helpful. |
I have the same issue with vllm The issue reproduces every time I send my 64 requests. When sending 1-4 requests, everything is ok. Hope it helps in debugging. Output of `python collect_env.py` outside docker
Output of `python collect_env.py` inside docker
What I run:GPU_COUNT=8
MODEL_PATH=Qwen2.5-Coder-32B-Instruct
QUANT_TYPE=bfloat16
PIPELINE_PARALLEL=1
PORT=9000
MAX_BATCH_SIZE=32
MAX_INPUT_TOKENS=20000
MAX_OUTPUT_TOKENS=10000
max_num_batched_tokens=64000
SERVED_MODEL_NAME=ensemble
VLLM_ATTENTION_BACKEND=FLASH_ATTN
docker run --rm -it --runtime nvidia --gpus all \
-v /home/models/huggingface/:/models \
-v /home/columpio/ai-inference:/ai-inference \
-p ${PORT}:${PORT} \
--ipc=host \
-e VLLM_ATTENTION_BACKEND=${VLLM_ATTENTION_BACKEND} \
-e VLLM_LOGGING_LEVEL=DEBUG \
-e CUDA_LAUNCH_BLOCKING=1 \
-e NCCL_DEBUG=TRACE \
-e VLLM_TRACE_FUNCTION=1 \
vllm/vllm-openai:v0.6.3.post1 \
--port ${PORT} \
--tensor-parallel-size ${GPU_COUNT} \
--pipeline-parallel-size ${PIPELINE_PARALLEL} \
--dtype ${QUANT_TYPE} \
--gpu-memory-utilization 0.9 \
--max-num-seqs ${MAX_BATCH_SIZE} \
--max-model-len $((MAX_INPUT_TOKENS + MAX_OUTPUT_TOKENS)) \
--max_num_batched_tokens ${max_num_batched_tokens} \
--served-model-name ${SERVED_MODEL_NAME} \
--model /models/${MODEL_PATH} 2>&1 | tee vllm.log Then I send 64 requests with 15000 tokens each from another server. I also tried modifying the above parameters a bit and here is what I get.
|
Your current environment
🐛 Describe the bug
While testing the new version, I ran into this CUDA error (not immediately, after a few successful iterations).
The only change compared to the previously working setup was the VLLM version upgrade and enabling prefix caching. The model served is
Mixtral 8x7B
(unquantized).Note that
CUDA_VISIBLE_DEVICES
was set to0,1,2,3
withtensor_parallel=4
to only use the first four GPUs (the machine has eight in total).Update: no issues with
v0.4.3
The text was updated successfully, but these errors were encountered: