GPTQ / Quantization support? #174

nikshepsvn · 2023-06-21T02:40:47Z

Will vLLM support 4-bit GPTQ models?

WoosukKwon · 2023-06-21T03:26:53Z

Thanks for the feature request! Quantization is not currently supported, but it's definitely on our roadmap. Please stay tuned.

nikshepsvn · 2023-06-22T01:19:22Z

How do I best go about tracking this? Is there a discord or public roadmap somewhere I can look at?

Symbolk · 2023-08-15T06:13:12Z

How do I best go about tracking this? Is there a discord or public roadmap somewhere I can look at?

See Roadmap here: #244

chu-tianxiang · 2023-08-23T14:02:50Z

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.

python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json

Model	Throughout (requests/s)	Throughout (tokens/s)
meta-llama/Llama-2-13b-chat-hf	4.00	1915
TheBloke/Llama-2-13B-chat-GPTQ	3.32	1587
TheBloke/Llama-2-70B-chat-GPTQ	1.09	519

osilverstein · 2023-08-25T05:39:40Z

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.
python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Model Throughout (requests/s) Throughout (tokens/s)
meta-llama/Llama-2-13b-chat-hf 4.00 1915
TheBloke/Llama-2-13B-chat-GPTQ 3.32 1587
TheBloke/Llama-2-70B-chat-GPTQ 1.09 519

what's the baseline with normal version?

chu-tianxiang · 2023-08-25T11:38:47Z

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.
python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Model Throughout (requests/s) Throughout (tokens/s)
meta-llama/Llama-2-13b-chat-hf 4.00 1915
TheBloke/Llama-2-13B-chat-GPTQ 3.32 1587
TheBloke/Llama-2-70B-chat-GPTQ 1.09 519
what's the baseline with normal version?

If you mean the throughput, in the above table TheBloke/Llama-2-13B-chat-GPTQ is quantized from meta-llama/Llama-2-13b-chat-hf and the throughput is about 17% less.

I dug into the kernel code of quant linear layer and found that it falls back to dequantization followed by fp16 matrix multiplication when the batch size is bigger than 8, so the performance degradation is understandable.

chu-tianxiang · 2023-08-26T15:59:30Z

As an update, I added tensor parallel QuantLinear layer and supported most AutoGPT compatible models in this branch. The code has not been thoroughly tested yet because the combinations of model architectures and GPTQ settings are way too many.

singularity-sg · 2023-09-29T01:52:38Z

@chu-tianxiang I tried forking your vllm-gptq branch and was successful deploying the TheBloke/Llama-2-13b-Chat-GPTQ model. However, when I tried the TheBloke/Llama-2-7b-Chat-GPTQ model, it threw the following exception whenever I made a query to the model. I wonder if the issue is with the model itself or something else. I'll dig further into this when I have the chance but it's likely the Sampler was generating the probability tensor with invalid values

INFO:     13.229.18.8:52663 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 330, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 191, in step_async
    output = await self._run_workers_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 220, in _run_workers_async
    all_outputs = await asyncio.gather(*all_outputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorker.execute_method() (pid=68296, ip=172.31.40.30, actor_id=3b90ca9f90ebf20a67ae6c2c01000000, repr=<vllm.engine.ray_utils.RayWorker object at 0x7ef09122dad0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/ray_utils.py", line 32, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/worker/worker.py", line 305, in execute_model
    output = self.model(
             ^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/models/llama.py", line 296, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 85, in forward
    return _sample(probs, logprobs, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 451, in _sample
    sample_results = _random_sample(seq_groups, is_prompts,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 342, in _random_sample
    random_samples = torch.multinomial(probs,
                     ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

wejoncy · 2023-10-25T11:27:56Z

Hi,
If anyone wants try GPTQ quantizationo in vLLM.
Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM.
And Of courcr you can select AWQ to quantize it as well.

David-Lee-1990 · 2023-12-07T09:58:26Z

is baichuan-gptq supported?

sssuperrrr · 2023-12-08T07:16:13Z

支持Qwen-72B-Chat-Int4加速吗？

uncensorie · 2023-12-09T23:23:31Z

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

Something is off with this QLLM gptq quantization @wejoncy ... all the dependencies aren't specified in requirements file. Also, 3 times tried quantizing and every time it breaks when it tries to save the file or quant is done. Tried Llama2-70b and mistral 7b

wejoncy · 2023-12-10T02:04:25Z

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

Something is off with this QLLM gptq quantization @wejoncy ... all the dependencies aren't specified in requirements file. Also, 3 times tried quantizing and every time it breaks when it tries to save the file or quant is done. Tried Llama2-70b and mistral 7b

Hi,
Thanks for try this out and sorry for the inconvenient. This bug has been fixed in latest:main.
For now, you can have too ways to use GPTQ quant method in vLLM with qllm tool.

such as Llama-families, convert to AWQ ifi you didn't enable act_order and set bits==4 and there is no mix bits inside.
use GPTQ directly. But the GPTQ branch in vLLM is on the way merged.

jacobwarren · 2024-02-02T05:29:58Z

Is there any update for 8bit support? That would help Mixtral generate useable outputs on a single (non-overpriced) GPU.

hmellor · 2024-02-02T17:48:46Z

I have successfully used both GPTQ and AWQ models with vLLM.

Should this issue be considered solved @WoosukKwon?

jacobwarren · 2024-02-03T22:10:55Z

@hmellor it currently works with 4-bit, but not 8-bit. Currently you have to use chu-tianxiang/vllm-gptq to get 8-bit support.

hmellor · 2024-03-06T09:01:49Z

Closing as this was resolved by #2330

SuperBruceJia · 2024-06-27T21:22:48Z

Is vLLM well-supporting the int-2 GPTW models?

Thank you very much!

SuperBruceJia · 2024-06-27T21:23:26Z

@hmellor @singularity-sg @jacobwarren @wejoncy @Symbolk @WoosukKwon

Sanity check done: Server mode; BS1 perf; Llama405b FP8

WoosukKwon added the feature request New feature or request label Jun 21, 2023

zhuohan123 mentioned this issue Jun 25, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

Xingxiangrui mentioned this issue Jun 27, 2023

GPT-Q supproted ? #274

Closed

xffxff mentioned this issue Jan 8, 2024

用 VLLM 加载 Yi-34B-Chat-4bits-GPTQ, 模型有时不停地输出空字串而不停止 01-ai/Yi#275

Closed

hmellor closed this as completed Mar 6, 2024

mht-sharma pushed a commit to mht-sharma/vllm that referenced this issue Oct 30, 2024

Merge pull request vllm-project#174 from ROCm/upstream_merge_24_9_9

4a7f8d6

Sanity check done: Server mode; BS1 perf; Llama405b FP8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ / Quantization support? #174

GPTQ / Quantization support? #174

nikshepsvn commented Jun 21, 2023

WoosukKwon commented Jun 21, 2023

nikshepsvn commented Jun 22, 2023

Symbolk commented Aug 15, 2023

chu-tianxiang commented Aug 23, 2023 •

edited

Loading

osilverstein commented Aug 25, 2023

chu-tianxiang commented Aug 25, 2023 •

edited

Loading

chu-tianxiang commented Aug 26, 2023

singularity-sg commented Sep 29, 2023

wejoncy commented Oct 25, 2023

David-Lee-1990 commented Dec 7, 2023

sssuperrrr commented Dec 8, 2023

uncensorie commented Dec 9, 2023 •

edited

Loading

wejoncy commented Dec 10, 2023

jacobwarren commented Feb 2, 2024

hmellor commented Feb 2, 2024

jacobwarren commented Feb 3, 2024

hmellor commented Mar 6, 2024

SuperBruceJia commented Jun 27, 2024

SuperBruceJia commented Jun 27, 2024

GPTQ / Quantization support? #174

GPTQ / Quantization support? #174

Comments

nikshepsvn commented Jun 21, 2023

WoosukKwon commented Jun 21, 2023

nikshepsvn commented Jun 22, 2023

Symbolk commented Aug 15, 2023

chu-tianxiang commented Aug 23, 2023 • edited Loading

osilverstein commented Aug 25, 2023

chu-tianxiang commented Aug 25, 2023 • edited Loading

chu-tianxiang commented Aug 26, 2023

singularity-sg commented Sep 29, 2023

wejoncy commented Oct 25, 2023

David-Lee-1990 commented Dec 7, 2023

sssuperrrr commented Dec 8, 2023

uncensorie commented Dec 9, 2023 • edited Loading

wejoncy commented Dec 10, 2023

jacobwarren commented Feb 2, 2024

hmellor commented Feb 2, 2024

jacobwarren commented Feb 3, 2024

hmellor commented Mar 6, 2024

SuperBruceJia commented Jun 27, 2024

SuperBruceJia commented Jun 27, 2024

chu-tianxiang commented Aug 23, 2023 •

edited

Loading

chu-tianxiang commented Aug 25, 2023 •

edited

Loading

uncensorie commented Dec 9, 2023 •

edited

Loading