-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Add ExLlamaV2 Weight Quantization Support #11348
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried loading a qwen exl2 model and ran into issues with uninitialized parameter usage
vllm serve cgus/Qwen2.5-0.5B-Instruct-exl2 --port 9000 --dtype float16
...
Traceback (most recent call last):
File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
raise e
File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 71, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/engine/llm_engine.py", line 273, in __init__
self.model_executor = executor_class(vllm_config=vllm_config, )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/executor/executor_base.py", line 36, in __init__
self._init_executor()
File "/home/mgoin/code/alpin-vllm/vllm/executor/gpu_executor.py", line 35, in _init_executor
self.driver_worker.load_model()
File "/home/mgoin/code/alpin-vllm/vllm/worker/worker.py", line 155, in load_model
self.model_runner.load_model()
File "/home/mgoin/code/alpin-vllm/vllm/worker/model_runner.py", line 1094, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
return loader.load_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
loaded_weights = model.load_weights(
^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 506, in load_weights
return loader.load_weights(weights)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 237, in load_weights
autoloaded_weights = set(self._load_module("", self.module, weights))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 198, in _load_module
yield from self._load_module(prefix,
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 175, in _load_module
loaded_params = module_load_weights(weights)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 396, in load_weights
weight_loader(param, loaded_weight)
File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/weight_utils.py", line 531, in default_weight_loader
if param.numel() == 1 and loaded_weight.numel() == 1:
^^^^^^^^^^^^^
File "/home/mgoin/venvs/alpin/lib/python3.12/site-packages/torch/nn/parameter.py", line 168, in __torch_function__
raise ValueError(
ValueError: Attempted to use an uninitialized parameter in <method 'numel' of 'torch._C.TensorBase' objects>. This error happens when you are using a `LazyModule` or explicitly manipulating `torch.nn.parameter.UninitializedParameter` objects. When using LazyModules Call `forward` with a dummy batch to initialize the parameters before calling torch functions
@@ -26,6 +26,7 @@ | |||
"experts_int8", | |||
"neuron_quant", | |||
"ipex", | |||
"exllamav2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be exl2 to match the tag in quantization_config - reference https://huggingface.co/cgus/Qwen2.5-0.5B-Instruct-exl2/blob/main/config.json
"exllamav2" | |
"exl2" |
@@ -79,6 +81,7 @@ def get_quantization_config(quantization: str) -> Type[QuantizationConfig]: | |||
"experts_int8": ExpertsInt8Config, | |||
"neuron_quant": NeuronQuantConfig, | |||
"ipex": IPEXConfig, | |||
"exllamav2": Exl2Config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"exllamav2": Exl2Config, | |
"exl2": Exl2Config, |
Work in Progress.
FIX #3203