Qwen2VL FP8_DYNAMIC Failed #951

LugerW-A · 2024-12-04T06:37:43Z

When I use the example code to quant Qwen2VL. It can run successfully, but the number of safetensors for the generated model parameters remains the same, and the size of each safetensor does not change. It also fails to load successfully using vllm.

vllm:
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/bn/vllmlfdata/deploymultitokens_place_TEST/handlerfp8.py", line 125, in
[rank0]: handler = EndpointHandler()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/bn/vllmlfdata/deploymultitokens_place_TEST/handlerfp8.py", line 42, in init
[rank0]: self.llm = LLM(MODEL_PATH,
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 178, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 550, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 317, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 125, in _init_executor
[rank0]: self._run_workers("load_model",
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 999, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 361, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1061, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.0.mlp.gate_up_proj.weight_scale'

PeterWang1986 · 2024-12-04T09:47:32Z

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct
GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

dsikka · 2024-12-04T14:31:56Z

Hi @LugerW-A it looks like the model isn't compressing after being quantized. Can you share what version of llm-compressor and compressed-tensors you're using?

LugerW-A · 2024-12-05T02:19:44Z

Hi @LugerW-A it looks like the model isn't compressing after being quantized. Can you share what version of llm-compressor and compressed-tensors you're using?

Thank you.
GPU: L20 llm-compressor: 0.3.0 compressed-tensors:0.8.0 transformers:4.46.3; cuda 12.4, python:3.11.2
And I just use the same code in the examples for Qwen2-VL.
Set device_map="cpu" can't solve the problem

LugerW-A · 2024-12-05T03:59:31Z

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

Hi Did you solve it?

hestabit-dev · 2024-12-05T12:01:29Z

Did anyone solved this issue?

kylesayrs · 2024-12-05T19:49:24Z

Hi all, you can quantize vision models reliably using the kylesayrs/gptq-partition and running python3 examples/multimodal_vision/qwen.py. Please modify the scheme to reflect FP8 quantization. This pathway is untested for FP8, but it may prove to be more reliable.

PeterWang1986 · 2024-12-06T03:26:50Z

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

Hi Did you solve it?

No, currently we load it by using vllm with --quantization=fp8 as a workaround

LugerW-A · 2024-12-11T02:26:57Z

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

Hi Did you solve it?

No, currently we load it by using vllm with --quantization=fp8 as a workaround

Hi.Is there any difference between these two? How is the performance?

LugerW-A added the bug Something isn't working label Dec 4, 2024

dsikka self-assigned this Dec 4, 2024

kylesayrs assigned kylesayrs and unassigned dsikka Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2VL FP8_DYNAMIC Failed #951

Qwen2VL FP8_DYNAMIC Failed #951

LugerW-A commented Dec 4, 2024 •

edited

Loading

PeterWang1986 commented Dec 4, 2024 •

edited

Loading

dsikka commented Dec 4, 2024

LugerW-A commented Dec 5, 2024 •

edited

Loading

LugerW-A commented Dec 5, 2024

hestabit-dev commented Dec 5, 2024

kylesayrs commented Dec 5, 2024

PeterWang1986 commented Dec 6, 2024

LugerW-A commented Dec 11, 2024

Qwen2VL FP8_DYNAMIC Failed #951

Qwen2VL FP8_DYNAMIC Failed #951

Comments

LugerW-A commented Dec 4, 2024 • edited Loading

PeterWang1986 commented Dec 4, 2024 • edited Loading

dsikka commented Dec 4, 2024

LugerW-A commented Dec 5, 2024 • edited Loading

LugerW-A commented Dec 5, 2024

hestabit-dev commented Dec 5, 2024

kylesayrs commented Dec 5, 2024

PeterWang1986 commented Dec 6, 2024

LugerW-A commented Dec 11, 2024

LugerW-A commented Dec 4, 2024 •

edited

Loading

PeterWang1986 commented Dec 4, 2024 •

edited

Loading

LugerW-A commented Dec 5, 2024 •

edited

Loading