Skip to content

Commit 49edb97

Browse files
committed
update eagle3 related document for qwen3
Signed-off-by: bhsueh <[email protected]>
1 parent b6a3e0e commit 49edb97

File tree

1 file changed

+33
-0
lines changed

1 file changed

+33
-0
lines changed

examples/models/core/qwen/README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) m
2626
- [Serving](#serving)
2727
- [trtllm-serve](#trtllm-serve)
2828
- [Disaggregated Serving](#disaggregated-serving)
29+
- [Eagle3](#eagle3)
2930
- [Dynamo](#dynamo)
3031
- [Notes and Troubleshooting](#notes-and-troubleshooting)
3132
- [Credits](#credits)
@@ -891,6 +892,38 @@ Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mapping
891892
on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
892893
settings for your specific use case.
893894

895+
#### Eagle3
896+
897+
Qwen3 now supports Eagle3 (Speculative Decoding with Eagle3). To enable Eagle3 on Qwen3, you need to set the following arguments when running `trtllm-bench` or `trtllm-serve`:
898+
899+
- `speculative_config.decoding_type: Eagle`
900+
Set the decoding type to "Eagle" to enable Eagle3 speculative decoding.
901+
- `speculative_config.max_draft_len: 3`
902+
Set the maximum number of draft tokens generated per step (this value can be adjusted as needed).
903+
- `speculative_config.speculative_model_dir: <EAGLE3_DRAFT_MODEL_PATH>`
904+
Specify the path to the Eagle3 draft model (ensure the corresponding draft model weights are prepared).
905+
906+
Currently, there are some limitations when enabling Eagle3:
907+
908+
1. `attention_dp` is not supported. Please disable it or do not set the related flag (it is disabled by default).
909+
2. If you want to use `enable_block_reuse`, the kv cache type of the target model and the draft model must be the same. Since the draft model only supports fp16/bf16, you need to disable `enable_block_reuse` when using fp8 kv cache.
910+
911+
Example `extra-llm-api-config.yml` snippet for Eagle3:
912+
913+
```bash
914+
echo "
915+
enable_attention_dp: false
916+
speculative_config:
917+
decoding_type: Eagle
918+
max_draft_len: 3
919+
speculative_model_dir: <EAGLE3_DRAFT_MODEL_PATH>
920+
kv_cache_config:
921+
enable_block_reuse: false
922+
" >> ${path_config}
923+
```
924+
925+
For further details, please refer to [speculative-decoding.md](../../../../docs/source/advanced/speculative-decoding.md)
926+
894927
### Dynamo
895928

896929
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.

0 commit comments

Comments
 (0)