Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2VL FP8_DYNAMIC Failed #951

Open
LugerW-A opened this issue Dec 4, 2024 · 8 comments
Open

Qwen2VL FP8_DYNAMIC Failed #951

LugerW-A opened this issue Dec 4, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@LugerW-A
Copy link

LugerW-A commented Dec 4, 2024

When I use the example code to quant Qwen2VL. It can run successfully, but the number of safetensors for the generated model parameters remains the same, and the size of each safetensor does not change. It also fails to load successfully using vllm.

vllm:
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/bn/vllmlfdata/deploymultitokens_place_TEST/handlerfp8.py", line 125, in
[rank0]: handler = EndpointHandler()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/bn/vllmlfdata/deploymultitokens_place_TEST/handlerfp8.py", line 42, in init
[rank0]: self.llm = LLM(MODEL_PATH,
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 178, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 550, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 317, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 125, in _init_executor
[rank0]: self._run_workers("load_model",
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 999, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 361, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1061, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.0.mlp.gate_up_proj.weight_scale'

@LugerW-A LugerW-A added the bug Something isn't working label Dec 4, 2024
@PeterWang1986
Copy link

PeterWang1986 commented Dec 4, 2024

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct
GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

@dsikka
Copy link
Collaborator

dsikka commented Dec 4, 2024

Hi @LugerW-A it looks like the model isn't compressing after being quantized. Can you share what version of llm-compressor and compressed-tensors you're using?

@dsikka dsikka self-assigned this Dec 4, 2024
@LugerW-A
Copy link
Author

LugerW-A commented Dec 5, 2024

Hi @LugerW-A it looks like the model isn't compressing after being quantized. Can you share what version of llm-compressor and compressed-tensors you're using?

Thank you.
GPU: L20 llm-compressor: 0.3.0 compressed-tensors:0.8.0 transformers:4.46.3; cuda 12.4, python:3.11.2
And I just use the same code in the examples for Qwen2-VL.
Set device_map="cpu" can't solve the problem

@LugerW-A
Copy link
Author

LugerW-A commented Dec 5, 2024

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

Hi Did you solve it?

@hestabit-dev
Copy link

Did anyone solved this issue?

@kylesayrs kylesayrs assigned kylesayrs and unassigned dsikka Dec 5, 2024
@kylesayrs
Copy link
Collaborator

Hi all, you can quantize vision models reliably using the kylesayrs/gptq-partition and running python3 examples/multimodal_vision/qwen.py. Please modify the scheme to reflect FP8 quantization. This pathway is untested for FP8, but it may prove to be more reliable.

@PeterWang1986
Copy link

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

Hi Did you solve it?

No, currently we load it by using vllm with --quantization=fp8 as a workaround

@LugerW-A
Copy link
Author

we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct GPU: A100-40G, llm-compressor: 0.3.0, vllm: 0.6.4.post1

Hi Did you solve it?

No, currently we load it by using vllm with --quantization=fp8 as a workaround

Hi.Is there any difference between these two? How is the performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants