Skip to content

Commit

Permalink
Update Tags for scheduling
Browse files Browse the repository at this point in the history
  • Loading branch information
vikranth22446 committed Sep 13, 2024
1 parent 86b6e6d commit 718bdcc
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion content/posts/preble.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "Preble: Efficient Prompt Scheduling for Augmented Large Language Models"
date: 2024-05-07
draft: false
hideToc: false
tags: ["LLM", "Serving", "Load Balancing", "Prompt Oriented Scheduling"]
tags: ["LLM", "Serving", "Load Balancing", "Scheduling"]
truncated: false
summary: "
LLM prompts are growing more complex and longer with [agents](https://arxiv.org/abs/2308.11432), [tool use](https://platform.openai.com/docs/guides/function-calling), [large documents](https://arxiv.org/html/2404.07143v1), [video clips](https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window), and detailed [few-shot examples](https://arxiv.org/pdf/2210.03629). These prompts often have content that is shared across many requests. The computed intermediate state (KV cache) from one prompt can be reused by another for their shared parts to improve request handling performance and save GPU computation resources. However, current distributed LLM serving systems treat each request as independent and miss the opportunity to reuse the computed intermediate state.
Expand Down
2 changes: 1 addition & 1 deletion content/posts/scheduling_overhead.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "Can Scheduling Overhead Dominate LLM Inference Performance? A Study of C
date: 2024-09-10
draft: false
hideToc: false
tags: ["LLM", "Serving", "Iterative Scheduling"]
tags: ["LLM", "Serving", "Scheduling"]
truncated: false
summary: "Today’s LLM serving systems like [vLLM](https://github.com/vllm-project/vllm) and [TGI](https://huggingface.co/docs/text-generation-inference/en/index) primarily use a scheduling approach called iterative scheduling (or continuous batching), which decides the batch composition at every round (or every few rounds) of model forwarding. Different from prior serving systems that schedule the next batch after the entire current batch finishes, iterative scheduling promises to improve GPU utilization and LLM serving rate, but with a key assumption: the scheduling overhead can be ignored. While this assumption generally held in the past, it is worth reexamination as today’s LLM [inference kernels](https://flashinfer.ai/) run much faster than before and as more scheduling tasks and considerations get added.
Expand Down

0 comments on commit 718bdcc

Please sign in to comment.