Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Usage Doc for Prefill-decoding Disaggregation #71

Merged
merged 6 commits into from
Dec 5, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions docs/Prefill-decoding_Disaggregation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Prefill-decoding Disaggregation
zhypku marked this conversation as resolved.
Show resolved Hide resolved
Existing large language model (LLM) serving systems that support prefill-decoding disaggregation typically compute the prefill and decoding phases on separate instances. For each request, following the prefill phase, the system migrates the generated key-value (KV) cache to the decoding instance and continues the computation.


Llumnix can manage the scheduling of runtime requests across instances. It can perceive the execution process of requests and the state of the KV cache. Llumnix provides a series of scheduling semantics to support the initial dispatch and runtime rescheduling (request migration across instances).


Therefore, Llumnix is inherently well-suited for the "special" cross-instance scheduling requirements in prefill-decoding disaggregation. Specifically, it limits the initial **dispatch only to the prefill instances** and the **migration between prefill and decoding instances**.

## Supported Features
1. Requests can be **automatically migrated** from prefill instance to decoding instances.

2. Users can specify the number of prefill instances.

3. Llumnix supports both **one-to-many and many-to-one migrations** from prefill to decoding instances.
zhypku marked this conversation as resolved.
Show resolved Hide resolved

4. Decoding instances can migrate requests among themselves based on different scheduling strategies (e.g. load-balance).

zhypku marked this conversation as resolved.
Show resolved Hide resolved

## How to use
Llumnix only uses two arguments to enable prefill-decoding disaggregation for simplicity.
- `--enable-pd-disagg True` is used to enable prefill-decoding disaggregation.
- `--num-available-dispatch-instances` is used to configure the number of prefill instances.

Note that `num-available-dispatch-instances` < `initial_instance-num` especially when `--enable-scaling` is not set, as it determines the number of decoding instances.
KuilongCui marked this conversation as resolved.
Show resolved Hide resolved