We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用vllm 0.8.1 版本部署qwen2.5-vl-7B模型时,无法使用fp8量化,请问如何解决。 部署命令如下: vllm serve Qwen2.5-VL/Qwen2.5-VL-7B-Instruct --port 8083 --quantization fp8
vllm serve Qwen2.5-VL/Qwen2.5-VL-7B-Instruct --port 8083 --quantization fp8
报错如下:
...... Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.51it/s] INFO 03-21 02:21:53 [loader.py:429] Loading weights took 3.47 seconds INFO 03-21 02:21:53 [gpu_model_runner.py:1176] Model loading took 8.9031 GB and 3.891568 seconds INFO 03-21 02:21:53 [gpu_model_runner.py:1421] Encoder cache will be initialized with a budget of 98304 tokens, and profiled with 1 video items of the maximum feature size. ERROR 03-21 02:21:57 [core.py:340] EngineCore hit an exception: Traceback (most recent call last): ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 332, in run_engine_core ERROR 03-21 02:21:57 [core.py:340] engine_core = EngineCoreProc(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 287, in __init__ ERROR 03-21 02:21:57 [core.py:340] super().__init__(vllm_config, executor_class, log_stats) ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 62, in __init__ ERROR 03-21 02:21:57 [core.py:340] num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches( ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 121, in _initialize_kv_caches ERROR 03-21 02:21:57 [core.py:340] available_gpu_memory = self.model_executor.determine_available_memory() ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory ERROR 03-21 02:21:57 [core.py:340] output = self.collective_rpc("determine_available_memory") ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc ERROR 03-21 02:21:57 [core.py:340] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/utils.py", line 2216, in run_method ERROR 03-21 02:21:57 [core.py:340] return func(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 03-21 02:21:57 [core.py:340] return func(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory ERROR 03-21 02:21:57 [core.py:340] self.model_runner.profile_run() ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1452, in profile_run ERROR 03-21 02:21:57 [core.py:340] dummy_encoder_outputs = self.model.get_multimodal_embeddings( ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 975, in get_multimodal_embeddings ERROR 03-21 02:21:57 [core.py:340] video_embeddings = self._process_video_input(video_input) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 931, in _process_video_input ERROR 03-21 02:21:57 [core.py:340] video_embeds = self.visual(pixel_values_videos, grid_thw=grid_thw) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 03-21 02:21:57 [core.py:340] return self._call_impl(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 03-21 02:21:57 [core.py:340] return forward_call(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 659, in forward ERROR 03-21 02:21:57 [core.py:340] hidden_states = blk( ERROR 03-21 02:21:57 [core.py:340] ^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 03-21 02:21:57 [core.py:340] return self._call_impl(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 03-21 02:21:57 [core.py:340] return forward_call(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 382, in forward ERROR 03-21 02:21:57 [core.py:340] x = x + self.mlp(self.norm2(x)) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 03-21 02:21:57 [core.py:340] return self._call_impl(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 03-21 02:21:57 [core.py:340] return forward_call(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 191, in forward ERROR 03-21 02:21:57 [core.py:340] x_gate, _ = self.gate_proj(x) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 03-21 02:21:57 [core.py:340] return self._call_impl(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 03-21 02:21:57 [core.py:340] return forward_call(*args, **kwargs) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward ERROR 03-21 02:21:57 [core.py:340] output_parallel = self.quant_method.apply(self, input_, bias) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 386, in apply ERROR 03-21 02:21:57 [core.py:340] return self.fp8_linear.apply(input=x, ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 184, in apply ERROR 03-21 02:21:57 [core.py:340] output = ops.cutlass_scaled_mm(qinput, ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] File "/opt/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 523, in cutlass_scaled_mm ERROR 03-21 02:21:57 [core.py:340] assert (b.shape[0] % 16 == 0 and b.shape[1] % 16 == 0) ERROR 03-21 02:21:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 03-21 02:21:57 [core.py:340] AssertionError
The text was updated successfully, but these errors were encountered:
No branches or pull requests
使用vllm 0.8.1 版本部署qwen2.5-vl-7B模型时,无法使用fp8量化,请问如何解决。
部署命令如下:
vllm serve Qwen2.5-VL/Qwen2.5-VL-7B-Instruct --port 8083 --quantization fp8
报错如下:
The text was updated successfully, but these errors were encountered: