Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AlpinDale
Copy link
Contributor

@AlpinDale AlpinDale commented Dec 19, 2024

Work in Progress.

FIX #3203

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried loading a qwen exl2 model and ran into issues with uninitialized parameter usage

vllm serve cgus/Qwen2.5-0.5B-Instruct-exl2 --port 9000 --dtype float16
...
Traceback (most recent call last):
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
    raise e
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 71, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/llm_engine.py", line 273, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/executor/executor_base.py", line 36, in __init__
    self._init_executor()
  File "/home/mgoin/code/alpin-vllm/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.load_model()
  File "/home/mgoin/code/alpin-vllm/vllm/worker/worker.py", line 155, in load_model
    self.model_runner.load_model()
  File "/home/mgoin/code/alpin-vllm/vllm/worker/model_runner.py", line 1094, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
    return loader.load_model(vllm_config=vllm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
    loaded_weights = model.load_weights(
                     ^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 506, in load_weights
    return loader.load_weights(weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 237, in load_weights
    autoloaded_weights = set(self._load_module("", self.module, weights))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 198, in _load_module
    yield from self._load_module(prefix,
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 175, in _load_module
    loaded_params = module_load_weights(weights)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 396, in load_weights
    weight_loader(param, loaded_weight)
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/weight_utils.py", line 531, in default_weight_loader
    if param.numel() == 1 and loaded_weight.numel() == 1:
       ^^^^^^^^^^^^^
  File "/home/mgoin/venvs/alpin/lib/python3.12/site-packages/torch/nn/parameter.py", line 168, in __torch_function__
    raise ValueError(
ValueError: Attempted to use an uninitialized parameter in <method 'numel' of 'torch._C.TensorBase' objects>. This error happens when you are using a `LazyModule` or explicitly manipulating `torch.nn.parameter.UninitializedParameter` objects. When using LazyModules Call `forward` with a dummy batch to initialize the parameters before calling torch functions

@@ -26,6 +26,7 @@
"experts_int8",
"neuron_quant",
"ipex",
"exllamav2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be exl2 to match the tag in quantization_config - reference https://huggingface.co/cgus/Qwen2.5-0.5B-Instruct-exl2/blob/main/config.json

Suggested change
"exllamav2"
"exl2"

@@ -79,6 +81,7 @@ def get_quantization_config(quantization: str) -> Type[QuantizationConfig]:
"experts_int8": ExpertsInt8Config,
"neuron_quant": NeuronQuantConfig,
"ipex": IPEXConfig,
"exllamav2": Exl2Config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"exllamav2": Exl2Config,
"exl2": Exl2Config,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExLlamaV2: exl2 support
2 participants