gpt-fast with SparQ Attention

We extend gpt-fast to support SparQ attention, a bandwidth-efficient attention algorithm that speeds up generation for existing LLMs with no fine tuning. For details of SparQ, see the paper.

The main branch tracks the gpt-fast repo. The with-sparq branch contains our modifications. You can compare "main" and "with-sparq" to see what we added.

You might also be interested in sparq-llama.cpp, our implementation of SparQ in llama.cpp.

Results

We obtain the following speedups on an H100 PCIe, using BF16 for the model parameters and KV cache, and compressing the memory transfers 8x with SparQ: "estimated theoretical max" shows an estimate of the best-cast speedup that could be achieved by SparQ if the attention operation was purely memory-bound, and all compute and communication was overlapped. See theoretical_speedups.py for how this is calculated.

How to reproduce the results

Install Python >=3.10
Install the requirements: pip install -r requirements.txt
Run huggingface-cli login or set the HF_TOKEN environment variable. The associated account must have access to meta-llama/Llama-2-7b-chat-hf
Download Llama 2 7b from Hugging Face, and prepare it for gpt-fast: ./scripts/prepare.sh "meta-llama/Llama-2-7b-chat-hf"
Updated expected_gpu in run_speedup_benchmark.py to the expected model of GPU (this avoid accidentally comparing results from different GPUs)
Run the benchmark: python run_speedup_benchmark.py

SparQ is implemented in PyTorch, not as a custom kernel. However, we found that torch.compile() was able to generate a performant implementation.

License

This repo is based off gpt-fast, which is released under the BSD 3 license. We also release our modifications under the BSD 3 license. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
scripts		scripts
.gitignore		.gitignore
GPTQ.py		GPTQ.py
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
eval.py		eval.py
example_benchmark.png		example_benchmark.png
kv_cache.py		kv_cache.py
model.py		model.py
quantize.py		quantize.py
requirements.txt		requirements.txt
run_document_demo.py		run_document_demo.py
run_speedup_benchmark.py		run_speedup_benchmark.py
run_time_in_attention.py		run_time_in_attention.py
setup.py		setup.py
sparq.py		sparq.py
test_sparq.py		test_sparq.py
theoretical_speedups.py		theoretical_speedups.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpt-fast with SparQ Attention

Results

How to reproduce the results

License

About

Releases

Packages

Contributors 2

Languages

License

graphcore-research/sparq-gpt-fast

Folders and files

Latest commit

History

Repository files navigation

gpt-fast with SparQ Attention

Results

How to reproduce the results

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages