[Bug]: vllm is crashed on v0.5.3.post1 #7161

tonyaw · 2024-08-05T14:27:40Z

Your current environment

The output of `python collect_env.py`

It is

🐛 Describe the bug

I'm using Llama3.1 to do inference, and container is crashed.

My command to start vllm:

        image: vllm/vllm-openai:v0.5.3.post1                                                                                             
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args: ["--model", "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", "--host", "0.0.0.0", "--port", "8080", "--tensor-parallel-size", "2", "--trust-remote-code"]

Logs when container is crashed:

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa150ea6897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa150e56b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa150f7e718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa15217b8e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fa15217f9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fa15218505c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa152185dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fa19dc3cdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fa19ecfe609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fa19ee38353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa150ea6897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa150e56b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa150f7e718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa15217b8e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fa15217f9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fa15218505c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa152185dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fa19dc3cdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fa19ecfe609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fa19ee38353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa150ea6897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7fa151e09119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fa19dc3cdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fa19ecfe609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fa19ee38353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

mgoin · 2024-08-05T17:14:04Z

Hi @tonyaw this is a known issue that should be resolved by #6798. We will publish a new release this week with this included.

tonyaw · 2024-08-06T02:20:36Z

Thanks for your info. Is there any WA to avoid the crash?

Minami-su · 2024-08-09T10:08:47Z

The same issue still occurs when using Qwen2 72B Instruct with vllm 0.5.4.

mgoin · 2024-08-09T14:28:05Z

@tonyaw please try upgrading to 0.5.4

@Minami-su can you share your command, system details, and stacktrace?

Minami-su · 2024-08-11T01:01:16Z

@tonyaw please try upgrading to 0.5.4

@Minami-su can you share your command, system details, and stacktrace?

self.sampling_params = SamplingParams(temperature=0.95, top_p=0.9, top_k=20, repetition_penalty=1.05, max_tokens=8000, stop_token_ids=[tokenizer.eos_token_id])
            self.llm = LLM(model=model_path,enforce_eager=True,tensor_parallel_size=2,disable_custom_all_reduce=True,gpu_memory_utilization=gpu_memory_utilization)#,disable_custom_all_reduce=True)

rank0]:[E811 01:10:50.482336908 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered/s, output: 240.71 toks/s]
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b39e1177f86 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b39e1126d10 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b39e155bee8 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b398f3948a6 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b398f399ac0 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b398f3a077a in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7b398f3a2bbc in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7b39e0adbbf4 in /media/kemove/3.8TB/jcxy/anaconda3/envs/haolu/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7b39e2a94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7b39e2b26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E811 01:10:50.487303986 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b39e1177f86 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b39e1126d10 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b39e155bee8 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b398f3948a6 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b398f399ac0 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b398f3a077a in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7b398f3a2bbc in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7b39e0adbbf4 in /media/kemove/3.8TB/jcxy/anaconda3/envs/haolu/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7b39e2a94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7b39e2b26850 in /lib/x86_64-linux-gnu/libc.so.6)

/work/jcxy/anaconda3/envs/haolu/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

tonyaw · 2024-08-12T09:54:27Z

It looks like the backtrace is same as #7297
But the container is hang instead of crash.

tonyaw · 2024-08-13T08:22:13Z

Please see my previous message.
It looks like this issue isn't 100% fixed.
Could you please help to confirm?

tonyaw added the bug Something isn't working label Aug 5, 2024

mgoin closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vllm is crashed on v0.5.3.post1 #7161

[Bug]: vllm is crashed on v0.5.3.post1 #7161

tonyaw commented Aug 5, 2024

mgoin commented Aug 5, 2024 •

edited by linear bot

Loading

tonyaw commented Aug 6, 2024

Minami-su commented Aug 9, 2024

mgoin commented Aug 9, 2024

Minami-su commented Aug 11, 2024

tonyaw commented Aug 12, 2024

tonyaw commented Aug 13, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Comments

tonyaw commented Aug 5, 2024

Your current environment

🐛 Describe the bug

mgoin commented Aug 5, 2024 • edited by linear bot Loading

tonyaw commented Aug 6, 2024

Minami-su commented Aug 9, 2024

mgoin commented Aug 9, 2024

Minami-su commented Aug 11, 2024

tonyaw commented Aug 12, 2024

tonyaw commented Aug 13, 2024

mgoin commented Aug 5, 2024 •

edited by linear bot

Loading