[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

AlpinDale · 2024-12-19T22:23:23Z

Work in Progress.

github-actions · 2024-12-19T22:23:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin

Tried loading a qwen exl2 model and ran into issues with uninitialized parameter usage

vllm serve cgus/Qwen2.5-0.5B-Instruct-exl2 --port 9000 --dtype float16
...
Traceback (most recent call last):
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
    raise e
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 71, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/llm_engine.py", line 273, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/executor/executor_base.py", line 36, in __init__
    self._init_executor()
  File "/home/mgoin/code/alpin-vllm/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.load_model()
  File "/home/mgoin/code/alpin-vllm/vllm/worker/worker.py", line 155, in load_model
    self.model_runner.load_model()
  File "/home/mgoin/code/alpin-vllm/vllm/worker/model_runner.py", line 1094, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
    return loader.load_model(vllm_config=vllm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
    loaded_weights = model.load_weights(
                     ^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 506, in load_weights
    return loader.load_weights(weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 237, in load_weights
    autoloaded_weights = set(self._load_module("", self.module, weights))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 198, in _load_module
    yield from self._load_module(prefix,
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 175, in _load_module
    loaded_params = module_load_weights(weights)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 396, in load_weights
    weight_loader(param, loaded_weight)
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/weight_utils.py", line 531, in default_weight_loader
    if param.numel() == 1 and loaded_weight.numel() == 1:
       ^^^^^^^^^^^^^
  File "/home/mgoin/venvs/alpin/lib/python3.12/site-packages/torch/nn/parameter.py", line 168, in __torch_function__
    raise ValueError(
ValueError: Attempted to use an uninitialized parameter in <method 'numel' of 'torch._C.TensorBase' objects>. This error happens when you are using a `LazyModule` or explicitly manipulating `torch.nn.parameter.UninitializedParameter` objects. When using LazyModules Call `forward` with a dummy batch to initialize the parameters before calling torch functions

mgoin · 2024-12-20T15:30:11Z

vllm/model_executor/layers/quantization/__init__.py

@@ -26,6 +26,7 @@
    "experts_int8",
    "neuron_quant",
    "ipex",
+    "exllamav2"


Needs to be exl2 to match the tag in quantization_config - reference https://huggingface.co/cgus/Qwen2.5-0.5B-Instruct-exl2/blob/main/config.json

Suggested change

"exllamav2"

"exl2"

mgoin · 2024-12-20T15:30:18Z

vllm/model_executor/layers/quantization/__init__.py

@@ -79,6 +81,7 @@ def get_quantization_config(quantization: str) -> Type[QuantizationConfig]:
        "experts_int8": ExpertsInt8Config,
        "neuron_quant": NeuronQuantConfig,
        "ipex": IPEXConfig,
+        "exllamav2": Exl2Config,


Suggested change

"exllamav2": Exl2Config,

"exl2": Exl2Config,

add exllamav2 kernels

ac439e9

mergify bot added the ci/build label Dec 19, 2024

mgoin added quantization kernel labels Dec 19, 2024

AlpinDale added 3 commits December 20, 2024 00:24

fix compilation

491671b

int64_t -> int in registration

4eb9a09

add exl2 linear method

5bb4bc8

mgoin reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

AlpinDale commented Dec 19, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 19, 2024

mgoin left a comment

mgoin Dec 20, 2024

mgoin Dec 20, 2024

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

Are you sure you want to change the base?

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

Conversation

AlpinDale commented Dec 19, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 19, 2024

mgoin left a comment

Choose a reason for hiding this comment

mgoin Dec 20, 2024

Choose a reason for hiding this comment

mgoin Dec 20, 2024

Choose a reason for hiding this comment

AlpinDale commented Dec 19, 2024 •

edited by github-actions bot

Loading