-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vllm is crashed on v0.5.3.post1 #7161
Comments
Thanks for your info. Is there any WA to avoid the crash? |
The same issue still occurs when using Qwen2 72B Instruct with vllm 0.5.4. |
@tonyaw please try upgrading to 0.5.4 @Minami-su can you share your command, system details, and stacktrace? |
rank0]:[E811 01:10:50.482336908 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered/s, output: 240.71 toks/s] Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): [rank1]:[E811 01:10:50.487303986 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): /work/jcxy/anaconda3/envs/haolu/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown |
It looks like the backtrace is same as #7297 |
Please see my previous message. |
Your current environment
It is
🐛 Describe the bug
I'm using Llama3.1 to do inference, and container is crashed.
My command to start vllm:
Logs when container is crashed:
The text was updated successfully, but these errors were encountered: