Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-VL VLLM backend infer_dataset报Engine iteration timed out. This should never happen #2690

Open
heichang12138 opened this issue Dec 18, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@heichang12138
Copy link

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
[INFO:swift] request_config: RequestConfig(max_tokens=2048, temperature=0.0, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stop=[], seed=None, stream=False, logprobs=False, top_logprobs=None, n=1, best_of=None, presence_penalty=0.0, frequency_penalty=0.0, length_penalty=1.0)
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████| 49443/49443 [00:05<00:00, 8797.95 examples/s]
[INFO:swift] val_dataset: Dataset({
features: ['messages', 'label', 'images'],
num_rows: 49443
})
0%| | 0/49443 [00:00<?, ?it/s][INFO:swift] Using environment variable IMAGE_FACTOR, Setting image_factor: 8.
[INFO:swift] Setting resized_height: None. You can adjust this hyperparameter through the environment variable: RESIZED_HEIGHT.
[INFO:swift] Setting resized_width: None. You can adjust this hyperparameter through the environment variable: RESIZED_WIDTH.
[INFO:swift] Setting min_pixels: 3136. You can adjust this hyperparameter through the environment variable: MIN_PIXELS.
[INFO:swift] Using environment variable MAX_PIXELS, Setting max_pixels: 65536.
INFO 12-18 17:08:03 preprocess.py:215] Your model uses the legacy input pipeline instead of the new multi-modal processor. Please note that the legacy pipeline will be removed in a future release. For more details, see: vllm-project/vllm#10114
/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/PIL/Image.py:1056: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
warnings.warn(
ERROR 12-18 17:09:03 async_llm_engine.py:886] Engine iteration timed out. This should never happen!
Exception in callback VllmEngine.patch_remove_log..new_log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py:397
handle: <Handle VllmEngine.patch_remove_log..new_log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py:397>
Traceback (most recent call last):
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 866, in run_engine_loop
done, _ = await asyncio.wait(
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 399, in new_log_task_completion
return_value = task.result()
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 865, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
Exception in thread Thread-4 ():
Traceback (most recent call last):
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 866, in run_engine_loop
done, _ = await asyncio.wait(
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 399, in new_log_task_completion
return_value = task.result()
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 865, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/caojunhao/env/envs/swift/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/caojunhao/env/envs/swift/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/infer_engine.py", line 70, in
thread = Thread(target=lambda: asyncio.run(_batch_run(new_tasks)))
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/caojunhao/env/envs/swift/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/infer_engine.py", line 65, in _batch_run
return await asyncio.gather(*tasks)
File "/media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/infer_engine.py", line 61, in _run_infer
queue.put((i, await task))
File "/media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 389, in infer_async
return await self._infer_full_async(**kwargs)
File "/media/cfs/caojunhao/workspace/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 320, in _infer_full_async
async for result in result_generator:
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1051, in generate
async for output in await self.add_request(
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 948, in add_request
self.start_background_loop()
File "/home/caojunhao/env/envs/swift/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 745, in start_background_loop
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
vllm 0.6.4.post1
torch 2.5.1
CUDA Version: 12.2
NVIDIA GeForce RTX 4090 D

Additional context
Add any other context about the problem here(在这里补充其他信息)
我debug了一下发现应该是数据太多,导致异步执行的时候超时。
我在swift/llm/infer/infer.py修改成了这样,可以丑陋的跑起来了,期待大佬的优化。
image

@Jintao-Huang Jintao-Huang added the bug Something isn't working label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants