add doc for h2o

Signed-off-by: n1ck-guo <[email protected]>
intel · Jul 15, 2024 · 3723158 · 3723158
1 parent d241c25
commit 3723158
Show file tree

Hide file tree

Showing 3 changed files with 49 additions and 0 deletions.
diff --git a/docs/h2o.md b/docs/h2o.md
@@ -0,0 +1,49 @@
+# H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
+1. [Introduction](#introduction)
+2. [Usage](#usage)
+
+## Introduction
+**Heavy-Hitter Oracal (H2O)** is a novel approach for implementing the KV cache wihich significantly reduces memory footprint. 
+
+This methods base on the fact that the accumulated attention scores of all tokens in attention blocks adhere to a power-law distribution. It suggests that there exists a small set of influential tokens that are critical during generation, named heavy-hitters (H2). H2 provides an opportunity to step away from the combinatorial search problem and identify an eviction policy that maintains accuracy.
+
+H2O can dynamically retains the balance of recent and H2 tokens. Significantly increase model throughput while ensuring accuracy.
+
+
+For more info, please refer to the paper [H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models](https://arxiv.org/pdf/2306.14048).
+
+
+![](./imgs/h2o.png)
+
+
+## Usage
+Using simulation mode
+```python
+from intel_extension_for_transformers.transformers.kv_cache_compression import H2OConfig, LlamaForCausalLM
+h2o_config = H2OConfig(
+    heavy_ratio=heavy_ratio,
+    recent_ratio=recent_ratio,
+    h2o_min_seqlen=h2o_min_seqlen,
+    real_drop=False,
+)
+user_model = LlamaForCausalLM.from_pretrained(
+    args.model,
+    prune_config=h2o_config,
+    trust_remote_code=args.trust_remote_code)
+```
+To run the real_drop mode
+```python
+from intel_extension_for_transformers.transformers.kv_cache_compression import H2OConfig, LlamaForCausalLM
+h2o_config = H2OConfig(
+    heavy_ratio=heavy_ratio,
+    recent_ratio=recent_ratio,
+    h2o_min_seqlen=h2o_min_seqlen,
+    real_drop=True,
+)
+user_model = LlamaForCausalLM.from_pretrained(
+    args.model,
+    prune_config=h2o_config,
+    trust_remote_code=args.trust_remote_code)
+```
+
+Please refer to [h2o example](../examples/huggingface/pytorch/text-generation/h2o/run_generation.py) for the details.
diff --git a/docs/imgs/h2o.png b/docs/imgs/h2o.png
diff --git a/examples/huggingface/pytorch/text-generation/h2o/imgs/1.png b/examples/huggingface/pytorch/text-generation/h2o/imgs/1.png