-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]:Structured outputs inference often took a very long time,and eventually causing a timeout and vLLM engine crushing. #10081
Comments
Structured Output seems to function correctly only with simple structures (fewer than 3 fields). With 4 or more fields, there is a high likelihood of causing a crash like leading. I tested one task, and it worked fine when I split it into multiple tasks; otherwise, the task would crash. |
May be related to #9032 |
@hpx502766238 @DarkLight1337 Anyone have an idea about this BUG @hpx502766238 |
tokens/second(Avg generation throughput) was showing on terminal by vllm:Avg prompt throughput: 38.8 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 11 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 34.7%, CPU KV cache usage: 0.0%. |
We are seeing something similar with
[... (request comes in)]
INFO 12-09 05:27:39 metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%
INFO 12-09 05:27:49 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 12-09 05:27:49 metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%
INFO 12-09 05:27:59 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 12-09 05:27:59 metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%
INFO 12-09 05:28:09 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 12-09 05:28:09 metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%
ERROR 12-09 05:28:10 async_llm_engine.py:886] Engine iteration timed out. This should never happen!
ERROR 12-09 05:28:10 async_llm_engine.py:65] Engine background task failed
ERROR 12-09 05:28:10 async_llm_engine.py:65] Traceback (most recent call last):
ERROR 12-09 05:28:10 async_llm_engine.py:65] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 866, in run_engine_loop
ERROR 12-09 05:28:10 async_llm_engine.py:65] done, _ = await asyncio.wait(
ERROR 12-09 05:28:10 async_llm_engine.py:65] ^^^^^^^^^^^^^^^^^^^
ERROR 12-09 05:28:10 async_llm_engine.py:65] File "/usr/lib/python3.12/asyncio/tasks.py", line 464, in wait
ERROR 12-09 05:28:10 async_llm_engine.py:65] return await _wait(fs, timeout, return_when, loop)
ERROR 12-09 05:28:10 async_llm_engine.py:65] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-09 05:28:10 async_llm_engine.py:65] File "/usr/lib/python3.12/asyncio/tasks.py", line 550, in _wait
ERROR 12-09 05:28:10 async_llm_engine.py:65] await waiter
ERROR 12-09 05:28:10 async_llm_engine.py:65] asyncio.exceptions.CancelledError
ERROR 12-09 05:28:10 async_llm_engine.py:65]
ERROR 12-09 05:28:10 async_llm_engine.py:65] The above exception was the direct cause of the following exception:
ERROR 12-09 05:28:10 async_llm_engine.py:65]
ERROR 12-09 05:28:10 async_llm_engine.py:65] Traceback (most recent call last):
ERROR 12-09 05:28:10 async_llm_engine.py:65] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
ERROR 12-09 05:28:10 async_llm_engine.py:65] return_value = task.result()
ERROR 12-09 05:28:10 async_llm_engine.py:65] ^^^^^^^^^^^^^
ERROR 12-09 05:28:10 async_llm_engine.py:65] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 865, in run_engine_loop
ERROR 12-09 05:28:10 async_llm_engine.py:65] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 12-09 05:28:10 async_llm_engine.py:65] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-09 05:28:10 async_llm_engine.py:65] File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
ERROR 12-09 05:28:10 async_llm_engine.py:65] raise TimeoutError from exc_val
ERROR 12-09 05:28:10 async_llm_engine.py:65] TimeoutError
Exception in callback functools.partial(<function _log_task_completion at 0x78faf7c59300>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x78facaeae180>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x78faf7c59300>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x78facaeae180>>)>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 866, in run_engine_loop
done, _ = await asyncio.wait(
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/tasks.py", line 464, in wait
return await _wait(fs, timeout, return_when, loop)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/tasks.py", line 550, in _wait
await waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 865, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
raise TimeoutError from exc_val
TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 67, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO: 172.20.1.3:35712 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 866, in run_engine_loop
done, _ = await asyncio.wait(
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/tasks.py", line 464, in wait
return await _wait(fs, timeout, return_when, loop)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/tasks.py", line 550, in _wait
await waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
[...] |
Any updates on this ??? This is blocking us from using VLLM !!! |
You can try using guided decoding backends other than |
Worked for us on first glance, and faster too. Will evaluate more elaborately early next year. Thanks a lot! |
Please try using the latest Hopefully this issue can be closed now, but please let me know if you still have issues with this feature! |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
Structured output inference can take a very long time, even with just a single request, ultimately leading to timeouts or crashes. During inference, GPU KV cache usage gradually increases to 100%, while the average generation throughput drops from 30 tokens/s to 20 tokens/s, eventually causing a timeout and necessitating the use of the CPU KV cache. Even after one hour, there is no response to the structured output request sent earlier. Subsequently, I sent additional requests, including both normal and structured ones; the normal requests were responded to, albeit slowly, while the structured requests received no response. Over several more hours, an increasing number of new requests became pending and sequences were swapped, which eventually led to the vLLM engine crashing.
However, everything operates smoothly when only normal chat completion requests are sent, achieving an average generation throughput of 100 tokens/s or higher on dual Tesla V100 GPUs. The test model used was Qwen2-32B-GPTQ-Int8.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: