|
1 | 1 | # Prefill-decoding Disaggregation (Experimental)
|
2 | 2 |
|
3 |
| -Prefill-decoding disaggregation is a technique that computes the prefill and decoding phases on separate instances, designed for reducing the inteference between the two phases and better utilizing heterogeneous hardware. For each request, following the prefill phase, the system migrates the generated key-value (KV) cache to the decoding instance and continues the computation. |
| 3 | +Prefill-decoding disaggregation is a technique that computes the prefill and decoding phases on separate instances, designed mainly for reducing the inteference between the two phases and better utilizing heterogeneous hardware. For each request, following the prefill phase, the system migrates the generated key-value (KV) cache to the decoding instance and continues the computation. |
4 | 4 |
|
5 |
| -We find Llumnix well-suited for implementing P-D disaggregation, because this technique is inherently a special request scheduling policy and fits well in Llumnix's modeling for request scheduling. Specifically, P-D disaggregation can be decomposed into two rules (shown as below): (1) a special dispatching rule, i.e., P-instances-only; and (2) a special migration rule, i.e., migrate to D instances after one step. Llumnix provides an implementation of P-D disaggregation following this principle. |
| 5 | +We find Llumnix well-suited for implementing P-D disaggregation, because this technique is inherently a special request scheduling policy and fits well in Llumnix's modeling for request scheduling. Specifically, P-D disaggregation can be decomposed into two rules (shown below): (1) a special dispatching rule, i.e., P-instances-only; and (2) a special migration rule, i.e., migrate to D instances after one step. Llumnix provides an implementation of P-D disaggregation following this principle. |
6 | 6 |
|
7 | 7 | <div align=center>
|
8 | 8 | <img src="./pdd_1.png" align="center" width=80%/>
|
@@ -35,14 +35,15 @@ Currently P-D disaggregation is an experimental feature, mainly to demonstrate t
|
35 | 35 |
|
36 | 36 | 1. Per-layer KV cache transfer (currently we use a simple blocking transfer);
|
37 | 37 | 2. Explicit or automatic assignment of P/D instances (currently we only allow users to specify the instance numbers, with simple assignment rules);
|
38 |
| -3. Heterogeneous instances, e.g., different device types, sizes, or parallelisms; |
39 |
| -4. Fine tuning of the scheduling policies. |
| 38 | +3. Smarter fault tolerance (currently, due to the simple P/D assignment, if one of the instance types has all of its instances gone, the service will hang; we will implement better P/D assignment and fault tolerance strategies to ensure high availability); |
| 39 | +4. Heterogeneous instances, e.g., different device types, sizes, or parallelisms; |
| 40 | +5. Fine tuning of the scheduling policies. |
40 | 41 |
|
41 | 42 | We are actively working on these items. Stay tuned :)
|
42 | 43 |
|
43 | 44 | ## How to use
|
44 |
| -Llumnix only uses two arguments to enable prefill-decoding disaggregation for simplicity. |
| 45 | +Llumnix uses two simple arguments to enable prefill-decoding disaggregation in the current version. |
45 | 46 | - `--enable-pd-disagg True` is used to enable prefill-decoding disaggregation.
|
46 |
| -- `--num-available-dispatch-instances` is used to configure the number of prefill instances. |
| 47 | +- `--num-available-dispatch-instances` is used to configure the initial number of prefill instances. |
47 | 48 |
|
48 |
| -Note that `num-available-dispatch-instances` < `initial_instance-num` especially when `--enable-scaling` is not set, as it determines the number of decoding instances. |
| 49 | +Note that one should make sure that `num-available-dispatch-instances` is smaller than `initial_instance-num` (especially when `--enable-scaling` is not set), otherwise there would be no instances for decoding. |
0 commit comments