Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Commit

Permalink
add doc for h2o
Browse files Browse the repository at this point in the history
Signed-off-by: n1ck-guo <[email protected]>
  • Loading branch information
n1ck-guo committed Jul 15, 2024
1 parent d241c25 commit 3723158
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 0 deletions.
49 changes: 49 additions & 0 deletions docs/h2o.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
1. [Introduction](#introduction)
2. [Usage](#usage)

## Introduction
**Heavy-Hitter Oracal (H2O)** is a novel approach for implementing the KV cache wihich significantly reduces memory footprint.

This methods base on the fact that the accumulated attention scores of all tokens in attention blocks adhere to a power-law distribution. It suggests that there exists a small set of influential tokens that are critical during generation, named heavy-hitters (H2). H2 provides an opportunity to step away from the combinatorial search problem and identify an eviction policy that maintains accuracy.

H2O can dynamically retains the balance of recent and H2 tokens. Significantly increase model throughput while ensuring accuracy.


For more info, please refer to the paper [H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models](https://arxiv.org/pdf/2306.14048).


![](./imgs/h2o.png)


## Usage
Using simulation mode
```python
from intel_extension_for_transformers.transformers.kv_cache_compression import H2OConfig, LlamaForCausalLM
h2o_config = H2OConfig(
heavy_ratio=heavy_ratio,
recent_ratio=recent_ratio,
h2o_min_seqlen=h2o_min_seqlen,
real_drop=False,
)
user_model = LlamaForCausalLM.from_pretrained(
args.model,
prune_config=h2o_config,
trust_remote_code=args.trust_remote_code)
```
To run the real_drop mode
```python
from intel_extension_for_transformers.transformers.kv_cache_compression import H2OConfig, LlamaForCausalLM
h2o_config = H2OConfig(
heavy_ratio=heavy_ratio,
recent_ratio=recent_ratio,
h2o_min_seqlen=h2o_min_seqlen,
real_drop=True,
)
user_model = LlamaForCausalLM.from_pretrained(
args.model,
prune_config=h2o_config,
trust_remote_code=args.trust_remote_code)
```

Please refer to [h2o example](../examples/huggingface/pytorch/text-generation/h2o/run_generation.py) for the details.
Binary file added docs/imgs/h2o.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 3723158

Please sign in to comment.