Skip to content

Commit 1e71ba2

Browse files
committed
[None][doc] update feature_combination_matrix of disaggregated and chunked prefill
Signed-off-by: leslie-fang25 <[email protected]>
1 parent bff5fdf commit 1e71ba2

File tree

3 files changed

+42
-4
lines changed

3 files changed

+42
-4
lines changed

docs/source/torch/features/feature_combination_matrix.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@
66
| CUDA Graph | Yes | --- | | | | | | | | | | | | |
77
| Attention Data Parallelism | Yes | Yes | --- | | | | | | | | | | | |
88
| Disaggregated Serving | Yes | Yes | Yes | --- | | | | | | | | | | |
9-
| Chunked Prefill | Yes | Yes | Yes | Untested | --- | | | | | | | | | |
9+
| Chunked Prefill | Yes | Yes | Yes | Yes | --- | | | | | | | | | |
1010
| MTP | Yes | Yes | Yes | Yes | Yes | --- | | | | | | | | |
1111
| EAGLE-3(One Model Engine) | Yes | Yes | Yes | Yes | Yes | No | --- | | | | | | | |
12-
| EAGLE-3(Two Model Engine) | NO | Yes | Yes | Yes | Yes | No | No | --- | | | | | | |
12+
| EAGLE-3(Two Model Engine) | No | Yes | Yes | Yes | Yes | No | No | --- | | | | | | |
1313
| Torch Sampler | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | | | |
1414
| TLLM C++ Sampler | Yes | Yes | Yes | Yes | Yes | No | No | No | No | --- | | | | |
15-
| KV Cache Reuse | Yes | Yes | Yes | Untested | Yes | Untested | Yes | No | Yes | Yes | --- | | | |
16-
| Slide Window Attention | Yes | Yes | Yes | Untested | No | Untested | Untested | Untested | Yes | Yes | WIP | --- | | |
15+
| KV Cache Reuse | Yes | Yes | Yes | Yes | Yes | Untested | Yes | No | Yes | Yes | --- | | | |
16+
| Slide Window Attention | Yes | Yes | Yes | Yes | No | Untested | Untested | Untested | Yes | Yes | WIP | --- | | |
1717
| Logits Post Processor | No | Yes | Yes | No | Yes | No | No | No | Yes | Yes | Yes | Yes | --- | |
1818
| Guided Decoding | Yes | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | Yes | Yes | --- |

tests/integration/defs/accuracy/test_disaggregated_serving.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -797,6 +797,43 @@ def test_auto_dtype(self, overlap_scheduler):
797797
task = MMLU(self.MODEL_NAME)
798798
task.evaluate(llm)
799799

800+
def test_chunked_prefill(self):
801+
ctx_server_config = {
802+
"disable_overlap_scheduler": True,
803+
"cuda_graph_config": None,
804+
"cache_transceiver_config": {
805+
"backend": "DEFAULT"
806+
},
807+
"enable_chunked_prefill": True,
808+
"max_num_tokens": 256,
809+
}
810+
gen_server_config = {
811+
"cuda_graph_config": None,
812+
"cache_transceiver_config": {
813+
"backend": "DEFAULT"
814+
}
815+
}
816+
disaggregated_server_config = {
817+
"hostname": "localhost",
818+
"port": 8000,
819+
"backend": "pytorch",
820+
"context_servers": {
821+
"num_instances": 1,
822+
"urls": ["localhost:8001"]
823+
},
824+
"generation_servers": {
825+
"num_instances": 1,
826+
"urls": ["localhost:8002"]
827+
}
828+
}
829+
with launch_disaggregated_llm(disaggregated_server_config,
830+
ctx_server_config, gen_server_config,
831+
self.MODEL_PATH) as llm:
832+
task = GSM8K(self.MODEL_NAME)
833+
task.evaluate(llm)
834+
task = MMLU(self.MODEL_NAME)
835+
task.evaluate(llm)
836+
800837

801838
@skip_pre_blackwell
802839
@pytest.mark.timeout(3600)

tests/integration/test_lists/test-db/l0_dgx_h100.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ l0_dgx_h100:
4141
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_ngram
4242
- accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype[False]
4343
- accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype[True]
44+
- accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_chunked_prefill
4445
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=False-overlap_scheduler=False]
4546
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=True-overlap_scheduler=True]
4647
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]

0 commit comments

Comments
 (0)