Skip to content

Commit

Permalink
Update introduction.md
Browse files Browse the repository at this point in the history
  • Loading branch information
reyna-abhyankar authored May 2, 2024
1 parent 70ffde5 commit 7a49675
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion content/posts/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ Second, an LLM can often be called multiple times in a sequence to carry out a c
However, today's LLM serving systems are designed for standalone LLMs. They treat any interceptions to an LLM as starting a new request, causing unnecessary recomputation of already computed contexts.
This recomputation accounts for 37-40% of total model forwarding time in typical settings.

We built APIServe, the first LLM serving framework targeting augmented LLMs. APIServe minimizes the GPU resource waste caused by any interception to an LLM and dedicates saved memory for serving more requests. APIServe improves the overall serving throughput by 1.6x and completes 2x more requests per second compared to the state-of-the-art LLM serving systems.
We built APIServe (ICML '24), the first LLM serving framework targeting augmented LLMs. APIServe minimizes the GPU resource waste caused by any interception to an LLM and dedicates saved memory for serving more requests. APIServe improves the overall serving throughput by 1.6x and completes 2x more requests per second compared to the state-of-the-art LLM serving systems.

0 comments on commit 7a49675

Please sign in to comment.