-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen2VL FP8_DYNAMIC Failed #951
Comments
we met the same issue for Meta-Llama-3-8B-Instruct / Meta-Llama-3.1-8B-Instruct |
Hi @LugerW-A it looks like the model isn't compressing after being quantized. Can you share what version of llm-compressor and compressed-tensors you're using? |
Thank you. |
Hi Did you solve it? |
Did anyone solved this issue? |
Hi all, you can quantize vision models reliably using the |
No, currently we load it by using vllm with --quantization=fp8 as a workaround |
Hi.Is there any difference between these two? How is the performance? |
When I use the example code to quant Qwen2VL. It can run successfully, but the number of safetensors for the generated model parameters remains the same, and the size of each safetensor does not change. It also fails to load successfully using vllm.
vllm:
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/bn/vllmlfdata/deploymultitokens_place_TEST/handlerfp8.py", line 125, in
[rank0]: handler = EndpointHandler()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/bn/vllmlfdata/deploymultitokens_place_TEST/handlerfp8.py", line 42, in init
[rank0]: self.llm = LLM(MODEL_PATH,
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 178, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 550, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 317, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 125, in _init_executor
[rank0]: self._run_workers("load_model",
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 999, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 361, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1061, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.0.mlp.gate_up_proj.weight_scale'
The text was updated successfully, but these errors were encountered: