Skip to content

Commit

Permalink
Fix summary and images
Browse files Browse the repository at this point in the history
  • Loading branch information
reyna-abhyankar committed Jun 3, 2024
1 parent b991931 commit e4ef28d
Showing 1 changed file with 15 additions and 12 deletions.
27 changes: 15 additions & 12 deletions content/posts/infercept.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,21 @@ title: "Efficient Augmented LLM Serving With InferCept"
date: 2024-02-10
draft: false
hideToc: false
tags: ["LLM", "Serving", "Agents"]
---

tags: ["LLM", "Serving", "Augmented LLM"]
summary: "
Today's large language models (LLMs) are being paired with various tools and environments to satisfy increasingly complex user queries. Augmenting models with these capabilities means LLM <ins>**infer**</ins>ence can be inter<ins>**cept**</ins>ed by external actions. We designed [InferCept [ICML '24]](https://arxiv.org/pdf/2402.01869), the first serving framework designed for augmented LLMs. InferCept minimizes resource waste and sustains a **1.6x-2x higher serving load**, completing twice as many requests compared to [state-of-the-art serving systems](https://github.com/vllm-project/vllm). Try InferCept [here](https://github.com/WukLab/InferCept).
"

**TLDR:** Today's large language models (LLMs) are being paired with various tools and environments to satisfy increasingly complex user queries. Augmenting models with these capabilities means LLM <ins>**infer**</ins>ence can be inter<ins>**cept**</ins>ed by external actions. We designed [InferCept [ICML '24]](https://arxiv.org/pdf/2402.01869), the first serving framework designed for augmented LLMs. InferCept minimizes resource waste and sustains a **1.6x-2x higher serving load**, completing twice as many requests compared to [state-of-the-art serving systems](https://github.com/vllm-project/vllm). Try InferCept [here](https://github.com/WukLab/InferCept).
---
Author: Reyna Abhyankar and Yiying Zhang

**TLDR**: Today's large language models (LLMs) are being paired with various tools and environments to satisfy increasingly complex user queries. Augmenting models with these capabilities means LLM <ins>**infer**</ins>ence can be inter<ins>**cept**</ins>ed by external actions. We designed [InferCept [ICML '24]](https://arxiv.org/pdf/2402.01869), the first serving framework designed for augmented LLMs. InferCept minimizes resource waste and sustains a **1.6x-2x higher serving load**, completing twice as many requests compared to [state-of-the-art serving systems](https://github.com/vllm-project/vllm). Try InferCept [here](https://github.com/WukLab/InferCept).

## LLMs Today Are Augmented with External Tools and Environments

To broaden the capabilities of LLMs to handle more diverse tasks, there's a growing trend of augmenting LLMs with external tools and real-time interactions, such as [ChatGPT plugins](https://openai.com/index/chatgpt-plugins/), [non-language models](https://openai.com/index/dall-e-3/), [math tools](https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/), and [virtual environments](https://alfworld.github.io/). With fine tuning or prompt demonstration, LLMs can generate the triggering of an appropriate augmentation. When that happens, the LLM output generation is paused. We refer to all such non-LLM usages as “**interceptions**,” as they essentially intercept normal LLM generation.

![aug-llm-infer](../../static/images/infercept/aug-llm-inference-xl.gif)
![aug-llm-infer](/images/infercept/aug-llm-inference-xl.gif)

The workflow of LLMs with interception, as shown in the figure above, is as follows:
1. LLM generates tokens that trigger a particular tool, environment, or another model (an “**augmentation**”).
Expand All @@ -32,15 +35,15 @@ There are three potential techniques to deal with interceptions.

1. **Discard**. This is today’s inference systems approach, as described above.

![discard](../../static/images/infercept/discard-xl.gif)
![discard](/images/infercept/discard-xl.gif)

2. **Preserve**. The token states are preserved in GPU memory and wait for the interception to finish, at which point it can resume immediately without any recomputation. However, the preserved memory is unusable for other requests for the duration of the interception.

![preserve](../../static/images/infercept/preserve-xl.gif)
![preserve](/images/infercept/preserve-xl.gif)

3. **Swap**. We can swap token states out to CPU memory, such as in [offloading-based systems](https://github.com/FMInference/FlexGen). This alleviates the need for recomputation and frees up memory on the GPU, but those token states must be swapped in when the interception finishes. Swapping can stall other running requests because the amount of data being swapped often greatly exceeds the limited CPU-GPU bandwidth.

![swap](../../static/images/infercept/swap-xl.gif)
![swap](/images/infercept/swap-xl.gif)


## Introducing InferCept
Expand All @@ -60,18 +63,18 @@ The key idea behind InferCept is to minimize the memory waste caused by intercep

For **WasteDiscard** and **WasteSwap**, we pipeline the recomputation and swapping out and in for increased throughput. For the former, we [chunk](https://arxiv.org/abs/2308.16369) the context sequence into multiple segments and recompute one at an iteration, so that each iteration’s GPU computation resource is fully utilized but not exceeded. As a result, no other running requests are stalled because of recomputation.

![min-waste-discard](../../static/images/infercept/min-waste-discard-xl.gif)
![min-waste-discard](/images/infercept/min-waste-discard-xl.gif)

For swapping, we overlap all data communication with computation. By profiling the model and the CPU-GPU bus bandwidth, we identify a _swapping budget_ in terms of the number of tokens allowed to swap without incurring extra latency. As long as we do not swap more than this budget, we can completely eliminate **all waste from swapping**. Because of offline profiling, computing **WasteDiscard** and **WasteSwap** is easy and incurs no additional overhead during scheduling.

![min-waste-swap](../../static/images/infercept/min-waste-swap-xl.gif)
![min-waste-swap](/images/infercept/min-waste-swap-xl.gif)


### Scheduling requests to minimize GPU memory waste

In each iteration, we sort all intercepted requests in descending order based on their memory waste, which we compute as the minimum of **WasteDiscard** and **WastePreserve**. We swap out the KV context from these requests according to this order until we run out of the swap-out budget. For the remaining paused requests, we discard (preserve) their KV context if its **WasteDiscard** is smaller (greater) than **WastePreserve**.

![scheduling](../../static/images/infercept/scheduling-xl.gif)
![scheduling](/images/infercept/scheduling-xl.gif)

We maintain three queues: a running queue for all active requests, a waiting queue for all un-served and discarded requests, and a swapped queue for all requests in CPU memory. We follow FCFS scheduling based on the request’s original arrival time to ensure fairness.

Expand All @@ -86,7 +89,7 @@ We compare against four baselines:

For our evaluation, we compose a dataset of the six augmentations we studied.

![results](../../static/images/infercept/results.jpg)
![results](/images/infercept/results.jpg)

InferCept sustains **1.6x-2x** higher request arrival rates at the same low latency as vLLM, while completing **2x** more requests per second. It also has **1.9x-5.7x** lower normalized latency per output token. These findings hold for larger models and for distributed inference, where we see up to **12x** lower latency.

Expand Down

0 comments on commit e4ef28d

Please sign in to comment.