GPTQ 4bit Inference

Support GPTQ 4bit inference with GPTQ-for-LLaMa.

Window user: use the old-cuda branch.
Linux user: recommend the fastest-inference-4bit branch.

Install

Setup environment:

# cd /path/to/FastChat
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git repositories/GPTQ-for-LLaMa
cd repositories/GPTQ-for-LLaMa
# Window's user should use the `old-cuda` branch
git switch fastest-inference-4bit
# Install `quant-cuda` package in FastChat's virtualenv
python3 setup_cuda.py install
pip3 install texttable

Chat with the CLI:

python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-wbits 4 \
    --gptq-groupsize 128

Start model worker:

# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g

python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-wbits 4 \
    --gptq-groupsize 128

# You can specify which quantized model to use
python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-ckpt models/vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.safetensors \
    --gptq-wbits 4 \
    --gptq-groupsize 128 \
    --gptq-act-order

Benchmark

LLaMA-13B	branch	Bits	group-size	memory(MiB)	PPL(c4)	Median(s/token)	act-order	speed up
FP16	fastest-inference-4bit	16	-	26634	6.96	0.0383	-	1x
GPTQ	triton	4	128	8590	6.97	0.0551	-	0.69x
GPTQ	fastest-inference-4bit	4	128	8699	6.97	0.0429	true	0.89x
GPTQ	fastest-inference-4bit	4	128	8699	7.03	0.0287	false	1.33x
GPTQ	fastest-inference-4bit	4	-1	8448	7.12	0.0284	false	1.44x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gptq.md

gptq.md

GPTQ 4bit Inference

Install

Benchmark

Files

gptq.md

Latest commit

History

gptq.md

File metadata and controls

GPTQ 4bit Inference

Install

Benchmark