From ca285fc9b4c94dd7b375ee1ce008726085dab9fa Mon Sep 17 00:00:00 2001 From: atgctg <105969161+atgctg@users.noreply.github.com> Date: Wed, 22 Jan 2025 14:37:35 +0100 Subject: [PATCH] Fix typo --- blog/2024-01-17-sglang.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2024-01-17-sglang.md b/blog/2024-01-17-sglang.md index 34bdbcc1..c5f5a15d 100644 --- a/blog/2024-01-17-sglang.md +++ b/blog/2024-01-17-sglang.md @@ -37,7 +37,7 @@ While some systems are capable of handling KV cache reuse in certain scenarios, To systematically exploit these reuse opportunities, we introduce RadixAttention, a novel technique for automatic KV cache reuse during runtime. Instead of discarding the KV cache after finishing a generation request, our approach retains the KV cache for both prompts and generation results in a radix tree. This data structure enables efficient prefix search, insertion, and eviction. We implement a Least Recently Used (LRU) eviction policy, complemented by a cache-aware scheduling policy, to enhance the cache hit rate. -A radix tree is a data structure that serves as a space-efficient alternative to a trie (prefix tree). Unlike typical trees, the edges of a radix tree can be labeled with not just single elements, but also with sequences of elements of varying lengths. This feature boosts the efficiency of radix trees. In our system, we utilize a radix tree to manage a mapping. This mapping is between sequences of tokens, which act as the keys, and their corresponding KV cache tensors, which serve as the values. These KV cache tensors are stored on the GPU in a paged layout, where the size of each page is equivalent to one token. Considering the limited capacity of GPU memory, we cannot retrain infinite KV cache tensors, which necessitates an eviction policy. To tackle this, we implement an LRU eviction policy that recursively evicts leaf nodes. +A radix tree is a data structure that serves as a space-efficient alternative to a trie (prefix tree). Unlike typical trees, the edges of a radix tree can be labeled with not just single elements, but also with sequences of elements of varying lengths. This feature boosts the efficiency of radix trees. In our system, we utilize a radix tree to manage a mapping. This mapping is between sequences of tokens, which act as the keys, and their corresponding KV cache tensors, which serve as the values. These KV cache tensors are stored on the GPU in a paged layout, where the size of each page is equivalent to one token. Considering the limited capacity of GPU memory, we cannot retain infinite KV cache tensors, which necessitates an eviction policy. To tackle this, we implement an LRU eviction policy that recursively evicts leaf nodes. Furthermore, RadixAttention is compatible with existing techniques like continuous batching and paged attention. For multi-modal models, the RadixAttention can be easily extended to handle image tokens.