Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gptq_benchmark_update #1420

Merged
merged 25 commits into from
Nov 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5db2541
add_exllamav2
SunMarc Sep 27, 2023
03441b8
style
SunMarc Sep 27, 2023
80d085e
fix doc
SunMarc Sep 27, 2023
0c53c2f
simplify script
SunMarc Sep 27, 2023
216213e
style
SunMarc Sep 27, 2023
f2dbdc2
Merge branch 'add_exllamav2' into update-benchmark-gptq
SunMarc Sep 27, 2023
dadc6dc
update perplexity measure
SunMarc Sep 27, 2023
cf4019d
Revert "Merge branch 'add_exllamav2' into update-benchmark-gptq"
SunMarc Sep 27, 2023
97a7c62
Merge branch 'add_exllamav2' into update-benchmark-gptq
SunMarc Sep 27, 2023
62b89d9
fix arg in llama attention
SunMarc Sep 28, 2023
1ef6ce5
flash_attention arg
SunMarc Sep 29, 2023
f727313
Revert "Merge branch 'add_exllamav2' into update-benchmark-gptq"
SunMarc Sep 29, 2023
1eaedeb
update benchmark prefill and generate
SunMarc Sep 29, 2023
27a1ca1
Merge branch 'main' into update-benchmark-gptq
SunMarc Oct 24, 2023
a623556
replace by use_exllama_v2
SunMarc Oct 24, 2023
26d87e4
update benchmark arg
SunMarc Oct 27, 2023
4f797b1
switch to a config_dict instead of disable_exllamav2
SunMarc Nov 1, 2023
a89f0a3
Merge remote-tracking branch 'upstream/main' into better_config_for_e…
SunMarc Nov 1, 2023
1d845c7
Apply suggestions from code review
SunMarc Nov 1, 2023
d5b298f
better tests
SunMarc Nov 1, 2023
0694835
Merge remote-tracking branch 'origin/better_config_for_exllama' into …
SunMarc Nov 1, 2023
c21601d
style
SunMarc Nov 1, 2023
1c8c9e1
Merge branch 'better_config_for_exllama' into update-benchmark-gptq
SunMarc Nov 1, 2023
11e71e2
style
SunMarc Nov 1, 2023
71c2235
Merge remote-tracking branch 'upstream/main' into update-benchmark-gptq
SunMarc Nov 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 60 additions & 20 deletions tests/benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,27 @@ Please refer to https://medium.com/pytorch/bettertransformer-out-of-the-box-perf

# GPTQ benchmark

The results below are for AutoGPTQ 0.4.2, PyTorch 2.0.1, bitsandbytes 0.41.1, transformers 4.32.
The results below are for AutoGPTQ 0.5.0, PyTorch 2.0.1, bitsandbytes 0.41.1, transformers 4.35.

## Generation benchmark results

Run

```shell
git clone --branch main https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
cd Llama-2-13B-chat-GPTQ
mv gptq_model-4bit-128g.safetensors model.safetensors
mv quantize_config.json quantization_config.json

# pytorch fp16
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --sweep --num-batches 4 --task text-generation
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model meta-llama/Llama-2-13b-chat-hf --sweep --num-batches 4 --task text-generation --generate

# GPTQ with exllamav2 kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --sweep --num-batches 4 --gptq --task text-generation --use-exllama --exllama-version 2 --generate

# GPTQ with exllama kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --gptq-model /path/to/Llama-2-13B-chat-GPTQ/ --sweep --num-batches 4 --gptq --task text-generation
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --sweep --num-batches 4 --gptq --task text-generation --use-exllama --generate

# GPTQ without exllama kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --gptq-model /path/to/Llama-2-13B-chat-GPTQ/ --sweep --num-batches 4 --gptq --task text-generation --disable-exllama
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --sweep --num-batches 4 --gptq --task text-generation --generate

# using bitsandbytes fp4/fp16 scheme
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --sweep --num-batches 4 --task text-generation --bitsandbytes
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model meta-llama/Llama-2-13b-chat-hf --sweep --num-batches 4 --task text-generation --bitsandbytes --generate
```

Here are results obtained on a single NVIDIA A100-SXM4-80GB GPU. We use a prompt length of 512, and generate exactly 512 new tokens. Each generation is repeated for 4 batches, and metrics are averaged over the number of batches and generation length.
Expand All @@ -42,6 +40,7 @@ Bitsandbytes uses the fp4 scheme, with the compute in fp16.
|quantization |act_order|bits|group_size|kernel|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Peak memory (MB)|
|-----|---------|----|----------|------|-------------|----------------------|------------------|----------------|
|None|None |None|None |None |26.0 |36.958 |27.058 |29152.98 |
| gptq | False | 4 | 128 | exllamav2 | 36.07 | 32.25 | 31.01 | 11313.75 |
|gptq |False |4 |128 |exllama|36.2 |33.711 |29.663 |10484.34 |
|gptq |False |4 |128 |autogptq-cuda-old|36.2 |46.44 |21.53 |10344.62 |
|bitsandbytes|None |None|None |None |37.64 |52.00 |19.23 |11018.36 |
Expand All @@ -51,6 +50,7 @@ Bitsandbytes uses the fp4 scheme, with the compute in fp16.
|quantization |act_order|bits|group_size|kernel|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Peak memory (MB)|
|-----|---------|----|----------|------|-------------|----------------------|------------------|----------------|
|None|None |None|None |None |26.0 |37.35 |53.53 |30831.09 |
| gptq | False | 4 | 128 | exllamav2 | 36.07 | 35.81 | 55.85 | 12112.42 |
|gptq |False |4 |128 |exllama|36.2 |37.25 |53.68 |12162.43 |
|gptq |False |4 |128 |autogptq-cuda-old|36.2 |47.41 |42.18 |12020.34 |
|bitsandbytes|None |None|None |None |37.64 |74.62 |26.80 |12834.84 |
Expand All @@ -60,6 +60,7 @@ Bitsandbytes uses the fp4 scheme, with the compute in fp16.
|quantization |act_order|bits|group_size|kernel |Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Peak memory (MB)|
|-----|---------|----|----------|-----------------|-------------|----------------------|------------------|----------------|
|None|None |None|None |None |26.0 |37.89 |105.55 |34187.22 |
| gptq | False | 4 | 128 | exllamav2 | 36.07 | 36.04 | 110.98 | 16387.19 |
|gptq |False |4 |128 |exllama |36.2 |54.14 |73.87 |15518.55 |
|gptq |False |4 |128 |autogptq-cuda-old|36.2 |60.98 |65.59 |15374.67 |
|bitsandbytes|None |None|None |None |37.64 |80.24 |49.85 |16187.69 |
Expand All @@ -69,6 +70,7 @@ Bitsandbytes uses the fp4 scheme, with the compute in fp16.
|quantization |act_order|bits|group_size|kernel|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Peak memory (MB)|
|-----|---------|----|----------|------|-------------|----------------------|------------------|----------------|
|None|None |None|None |None |26.0 |47.37 |168.86 |40327.62 |
| gptq | False | 4 | 128 | exllamav2 | 36.07 | 47.31 | 169.11 | 22463.02 |
|gptq |False |4 |128 |exllama|36.2 |73.57 |108.73 |21864.56 |
|gptq |False |4 |128 |autogptq-cuda-old|36.2 |104.44 |76.59 |20987.68 |
|bitsandbytes|None |None|None |None |37.64 |91.29 |87.63 |22894.02 |
Expand All @@ -78,6 +80,7 @@ Bitsandbytes uses the fp4 scheme, with the compute in fp16.
|quantization |act_order|bits|group_size|kernel|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Peak memory (MB)|
|-----|---------|----|----------|------|-------------|----------------------|------------------|----------------|
|None|None |None|None |None |26.0 |69.94 |228.76 |53986.51 |
| gptq | False | 4 | 128 | exllamav2 | 36.07 | 83.09 | 192.55 | 35740.95 |
|gptq |False |4 |128 |exllama|36.2 |95.41 |167.68 |34777.04 |
|gptq |False |4 |128 |autogptq-cuda-old|36.2 |192.48 |83.12 |35497.62 |
|bitsandbytes|None |None|None |None |37.64 |113.98 |140.38 |35532.37 |
Expand All @@ -88,16 +91,19 @@ Run

```shell
# pytorch fp16
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --sweep --num-batches 10 --task text-generation --prefill
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model meta-llama/Llama-2-13b-chat-hf --sweep --num-batches 10 --task text-generation --prefill --generate

# GPTQ with exllama kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --gptq-model ../../../Llama-2-13B-chat-GPTQ/ --sweep --num-batches 10 --gptq --task text-generation --prefill
# GPTQ with exllamav2 kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --sweep --num-batches 10 --gptq --task text-generation --prefill --use-exllama --exllama-version 2 --generate

# GPTQ with exllamav kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --sweep --num-batches 10 --gptq --task text-generation --prefill --use-exllama --generate

# GPTQ without exllama kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --gptq-model ../../../Llama-2-13B-chat-GPTQ/ --sweep --num-batches 10 --gptq --task text-generation --prefill --disable-exllama
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --sweep --num-batches 10 --gptq --task text-generation --prefill --generate

# using bitsandbytes fp4/fp16 scheme
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model daryl149/llama-2-13b-chat-hf --sweep --num-batches 10 --task text-generation --prefill --bitsandbytes
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model meta-llama/Llama-2-13b-chat-hf --sweep --num-batches 10 --task text-generation --prefill --bitsandbytes --generate
```

The benchmark below is for a prompt length of 512, measuring only the prefill step on a single NVIDIA A100-SXM4-80GB GPU. The forward is repeated 10 times. This benchmark typically corresponds to the forward during training (to the difference that here `generate` is called, which has some overhead).
Expand All @@ -107,6 +113,7 @@ The benchmark below is for a prompt length of 512, measuring only the prefill st
|quantization |act_order|bits|group_size|kernel |prompt_length|new_tokens|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Max memory (MB)|
|-----|---------|----|----------|-----------------|-------------|----------|-------------|----------------------|------------------|---------------|
|None|None |None|None |None |512 |1 |27.22 |96.38 |10.38 |27999.54 |
| gptq | False | 4 | 128 | exllamav2 | 512 | 1 | 6.63 | 116.07 | 8.62 | 10260.35 |
|gptq |False |4 |128 |exllama |512 |1 |38.35 |112.54 |8.89 |9330.89 |
|gptq |False |4 |128 |autogptq-cuda-old|512 |1 |43.94 |368.13 |2.72 |9474.19 |
|bitsandbytes|None|None|None|None|512|1 |37.46|139.17 |7.19 |9952.65 |
Expand All @@ -116,6 +123,7 @@ The benchmark below is for a prompt length of 512, measuring only the prefill st
|quantization |act_order|bits|group_size|kernel |prompt_length|new_tokens|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Max memory (MB)|
|-----|---------|----|----------|-----------------|-------------|----------|-------------|----------------------|------------------|---------------|
|None|None |None|None |None |512 |1 |27.22 |169.95 |11.77 |28524.37 |
| gptq | False | 4 | 128 | exllamav2 | 512 | 1 | 6.63 | 212.07 | 9.43 | 10783.60 |
|gptq |False |4 |128 |exllama |512 |1 |38.35 |190.44 |10.50 |9855.71 |
|gptq |False |4 |128 |autogptq-cuda-old|512 |1 |43.94 |443.80 |4.51 |9928.23 |
|bitsandbytes|None|None|None|None|512|1 |37.46|212.76 |9.40 |10421.89|
Expand All @@ -125,6 +133,7 @@ The benchmark below is for a prompt length of 512, measuring only the prefill st
|quantization |act_order|bits|group_size|kernel |prompt_length|new_tokens|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Max memory (MB)|
|-----|---------|----|----------|-----------------|-------------|----------|-------------|----------------------|------------------|---------------|
|None|None |None|None |None |512 |1 |27.22 |305.99 |13.07 |29574.01 |
| gptq | False | 4 | 128 | exllamav2 | 512 | 1 | 6.63 | 385.58 | 10.37 | 11829.59 |
|gptq |False |4 |128 |exllama |512 |1 |38.35 |345.54 |11.58 |10905.35 |
|gptq |False |4 |128 |autogptq-cuda-old|512 |1 |43.94 |597.24 |6.70 |10838.42 |
|bitsandbytes|None|None|None|None|512|1 |37.46|349.18 |11.46|11440.08|
Expand All @@ -134,15 +143,46 @@ The benchmark below is for a prompt length of 512, measuring only the prefill st
|quantization |act_order|bits|group_size|kernel |prompt_length|new_tokens|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Max memory (MB)|
|-----|---------|----|----------|-----------------|-------------|----------|-------------|----------------------|------------------|---------------|
|None|None |None|None |None |512 |1 |27.22 |600.47 |13.32 |31673.30 |
| gptq | False | 4 | 128 | exllamav2 | 512 | 1 | 6.63 | 753.06 | 10.62 | 13920.50 |
|gptq |False |4 |128 |exllama |512 |1 |38.35 |659.61 |12.13 |13004.64 |
|gptq |False |4 |128 |autogptq-cuda-old|512 |1 |43.94 |909.09 |8.80 |12862.18 |
|bitsandbytes|None|None|None|None|512|1 |37.46|643.42 |12.43|13539.37|

### Batch size = 16

|quantization |act_order|bits|group_size|kernel |num_batches|batch_size|prompt_length|new_tokens|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Max memory (MB)|
|-----|---------|----|----------|-----------------|-----------|----------|-------------|----------|-------------|----------------------|------------------|---------------|
|None|None |None|None |None |10 |16 |512 |1 |27.22 |1209.07 |13.23 |35871.88 |
|gptq |False |4 |128 |exllama |10 |16 |512 |1 |38.35 |1280.25 |12.50 |17203.22 |
|gptq |False |4 |128 |autogptq-cuda-old|10 |16 |512 |1 |43.94 |1533.54 |10.43 |17060.76 |
|quantization |act_order|bits|group_size|kernel |prompt_length|new_tokens|Load time (s)|Per-token latency (ms)|Throughput (tok/s)|Max memory (MB)|
|-----|---------|----|-----------|----------|-------------|----------|-------------|----------------------|------------------|---------------|
|None|None |None|None |None |512 |1 |27.22 |1209.07 |13.23 |35871.88 |
| gptq | False | 4 | 128 | exllamav2 | 512 | 1 | 6.63 | 1467.36 | 10.90 | 18104.44 |
|gptq |False |4 |128 |exllama |512 |1 |38.35 |1280.25 |12.50 |17203.22 |
|gptq |False |4 |128 |autogptq-cuda-old |512 |1 |43.94 |1533.54 |10.43 |17060.76 |
|bitsandbytes|None|None|None|None|512|1 |37.46|1256.88|12.73|17737.95|

## Perplexity benchmark results

Run

```shell
# pytorch fp16
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model meta-llama/Llama-2-13b-chat-hf --task text-generation --ppl

# GPTQ with exllamav2 kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --revision gptq-4bit-128g-actorder_True --gptq --task text-generation --use-exllama --exllama-version 2 --ppl

# GPTQ with exllama kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --revision gptq-4bit-128g-actorder_True --gptq --task text-generation --use-exllama --ppl

# GPTQ without exllama kernel (int4/fp16)
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model TheBloke/Llama-2-13B-chat-GPTQ --revision gptq-4bit-128g-actorder_True --gptq --task text-generation --ppl

# using bitsandbytes fp4/fp16 scheme
CUDA_VISIBLE_DEVICES=0 python benchmark_gptq.py --model meta-llama/Llama-2-13b-chat-hf ---task text-generation --bitsandbytes --ppl
```

| quantization | act_order | bits | group_size | kernel | perplexity |
|--------------|-----------|------|------------|------------------|------------|
| None | None | None | None | None | 6.61 |
| gptq | True | 4 | 128 | exllamav2 | 6.77 |
| gptq | True | 4 | 128 | exllama | 6.77 |
| gptq | True | 4 | 128 | autogptq-cuda-old| 6.77 |
| bitsandbytes | None | 4 | None | None | 6.78 |
Loading
Loading