Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.
pip install kvpress
If possible, install flash attention:
pip install flash-attn --no-build-isolation
kvpress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a compression_ratio
attribute that measures the compression of the cache. The easiest way to use a press is through our custom KVPressTextGenerationPipeline
. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:
from transformers import pipeline
from kvpress import ExpectedAttentionPress
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)
context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context" # optional
press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, question=question, press=press)["answer"]
In the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the Wikipedia notebook demo for a more detailed example (also available on Colab here).
Important
We focus on compression during the pre-filling phase as the KV cache becomes a bottleneck for long-context sequence (100k - 1M tokens) which are essentially long context prompts. This would typically apply to improving prompt caching systems.
Note
Use model_kwargs={"attn_implementation":"flash_attention_2"}
to enable flash attention. To use the press ObservedAttentionPress
, you need to specify model_kwargs={"attn_implementation":"eager"}
as this press requires to materialize the attention weights
We welcome contributions! To add a new press, simply open an issue or submit a pull request. Check the new_press.ipynb notebook for a step-by-step guide.
All current presses are training free and inherit from BasePress
(source).
Several presses inherit from ScorerPress
(source) and rely on a score to prune the KV pairs with lowest importance:
RandomPress
(source): random scoreKnormPress
(source, paper): inverse norm of the keySnapKVPress
(source, paper): average attention weight of the last queriesExpectedAttentionPress
(source, notebook): expected attention weight during the generation phaseStreamingLLMPress
(source, paper): keep only the initial and recent tokensTOVAPress
(source, paper): attention weight of the last query averaged across headsObservedAttentionPress
(source, paper): average attention weight observed during in pre-filling phase
Some presses rely on a different logic:
ThinKPress
(source, paper): compress the dimensions of the keys based on the channel attention score on the last queriesSimLayerKVPress
(source, paper): identify "lazy" layers, and apply the StreamingLLM approach to them
Finally we provide wrapper presses that can be combined with other presses:
AdaKVPress
(source, paper): prune bottom scores of anyScorerPress
but across all heads, achieving head-wise compressionsPerLayerCompressionPress
(source): compress each layer with a different compression ratio (experimental)ComposedPress
(source): compose multiple presses together by chaining their forward hooksKeyRerotationPress
(source): rerotate pruned keys to have continuous RoPE embeddingsChunkPress
(source, paper): compress the KV cache on each sequence chunk separately. This can yield to more uniform compression across long sequences
For a detailed list of existing KV cache compression methods, check Awesome-KV-Cache-Compression or Awesome-LLM-Compression
The speed_and_memory.ipynb notebook can help you to measure peak memory usage and total time gain.
We provide a simple CLI to evaluate the performance of the different presses on several long-context datasets. Below we report the average performance on the RULER dataset with 4k context length for different presses.
Please refer to the evaluation directory for more details and results.
We support KV cache quantization through the transformers QuantizedCache
class (see HF blog post). To use it, simply pass a cache object to your pipeline:
from transformers import QuantizedCacheConfig, QuantoQuantizedCache
config = QuantizedCacheConfig(nbits=4)
cache = QuantoQuantizedCache(config)
pipe(..., cache=cache)
By default, the DynamicCache
is used (no quantization).
Important
To use the QuantizedCache
, you need to install additional dependencies (e.g. pip install optimum-quanto
).
Some presses depend on the model architecture (e.g. ExpectedAttentionPress
or SnapKVPress
) hence they might not work with all models. We tested support for LlamaForCausalLM
, MistralForCausalLM
, Phi3ForCausalLM
and Qwen2ForCausalLM
but many other models might be supported out of the box because their implementation is often similar in transformers.
Memory usage should be reduced by around compression_ratio * kv_cache_size
. As the KV cache is smaller, decoding should also be faster. You can measure peak memory usage gain and total time gain using this notebook.
A press registers a forward hook (press.forward_hook
method) to each attention layer during the pre-filling phase. Registration can be applied using the press as a context manager (press.__call__
method):
import torch
from transformers import AutoModelForCausalLM
from kvpress import KnormPress
device = "cuda:0"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(ckpt).to(device)
press = KnormPress(compression_ratio=0.4)
inputs = model.dummy_inputs["input_ids"].to(device)
with torch.no_grad():
print(model(inputs).past_key_values[0][0].shape)
# torch.Size([3, 8, 5, 128])
with torch.no_grad(), press(model):
print(model(inputs).past_key_values[0][0].shape)
# torch.Size([3, 8, 3, 128])
In fact you can use model.generate
with a press by using the press as a context manager:
with press(model):
outputs = model.generate(inputs)
However, the generate
method does not allow to exclude the question from the compression, which would artificially favors methods such as SnapKV. Ideally, we want a compression method that works whatever comes after the context (e.g. for use cases such as chat or document question answering). Finally the generate
method does not allow to provide generation for multiple questions at once.