Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed
tonyaw opened this issue Aug 5, 2024 · 7 comments
Closed

[Bug]: vllm is crashed on v0.5.3.post1 #7161

tonyaw opened this issue Aug 5, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@tonyaw
Copy link

tonyaw commented Aug 5, 2024

Your current environment

The output of `python collect_env.py`

It is

🐛 Describe the bug

I'm using Llama3.1 to do inference, and container is crashed.

My command to start vllm:

        image: vllm/vllm-openai:v0.5.3.post1                                                                                             
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args: ["--model", "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", "--host", "0.0.0.0", "--port", "8080", "--tensor-parallel-size", "2", "--trust-remote-code"]

Logs when container is crashed:

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa150ea6897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa150e56b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa150f7e718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa15217b8e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fa15217f9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fa15218505c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa152185dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fa19dc3cdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fa19ecfe609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fa19ee38353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa150ea6897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa150e56b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa150f7e718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa15217b8e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fa15217f9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fa15218505c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa152185dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fa19dc3cdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fa19ecfe609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fa19ee38353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa150ea6897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7fa151e09119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fa19dc3cdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fa19ecfe609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fa19ee38353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
@tonyaw tonyaw added the bug Something isn't working label Aug 5, 2024
@mgoin
Copy link
Member

mgoin commented Aug 5, 2024

Hi @tonyaw this is a known issue that should be resolved by #6798. We will publish a new release this week with this included.

@mgoin mgoin closed this as completed Aug 5, 2024
@tonyaw
Copy link
Author

tonyaw commented Aug 6, 2024

Thanks for your info. Is there any WA to avoid the crash?

@Minami-su
Copy link

The same issue still occurs when using Qwen2 72B Instruct with vllm 0.5.4.

@mgoin
Copy link
Member

mgoin commented Aug 9, 2024

@tonyaw please try upgrading to 0.5.4

@Minami-su can you share your command, system details, and stacktrace?

@Minami-su
Copy link

@tonyaw please try upgrading to 0.5.4

@Minami-su can you share your command, system details, and stacktrace?

self.sampling_params = SamplingParams(temperature=0.95, top_p=0.9, top_k=20, repetition_penalty=1.05, max_tokens=8000, stop_token_ids=[tokenizer.eos_token_id])
            self.llm = LLM(model=model_path,enforce_eager=True,tensor_parallel_size=2,disable_custom_all_reduce=True,gpu_memory_utilization=gpu_memory_utilization)#,disable_custom_all_reduce=True)

rank0]:[E811 01:10:50.482336908 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered/s, output: 240.71 toks/s]
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b39e1177f86 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b39e1126d10 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b39e155bee8 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b398f3948a6 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b398f399ac0 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b398f3a077a in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7b398f3a2bbc in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7b39e0adbbf4 in /media/kemove/3.8TB/jcxy/anaconda3/envs/haolu/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7b39e2a94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7b39e2b26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E811 01:10:50.487303986 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b39e1177f86 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7b39e1126d10 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7b39e155bee8 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b398f3948a6 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7b398f399ac0 in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7b398f3a077a in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7b398f3a2bbc in /work/jcxy/anaconda3/envs/haolu/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7b39e0adbbf4 in /media/kemove/3.8TB/jcxy/anaconda3/envs/haolu/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7b39e2a94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7b39e2b26850 in /lib/x86_64-linux-gnu/libc.so.6)

/work/jcxy/anaconda3/envs/haolu/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@tonyaw
Copy link
Author

tonyaw commented Aug 12, 2024

It looks like the backtrace is same as #7297
But the container is hang instead of crash.

@tonyaw
Copy link
Author

tonyaw commented Aug 13, 2024

Please see my previous message.
It looks like this issue isn't 100% fixed.
Could you please help to confirm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants