You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/models/core/qwen/README.md
+33Lines changed: 33 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,7 @@ This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) m
26
26
-[Serving](#serving)
27
27
-[trtllm-serve](#trtllm-serve)
28
28
-[Disaggregated Serving](#disaggregated-serving)
29
+
-[Eagle3](#eagle3)
29
30
-[Dynamo](#dynamo)
30
31
-[Notes and Troubleshooting](#notes-and-troubleshooting)
31
32
-[Credits](#credits)
@@ -891,6 +892,38 @@ Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mapping
891
892
on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
892
893
settings for your specific use case.
893
894
895
+
#### Eagle3
896
+
897
+
Qwen3 now supports Eagle3 (Speculative Decoding with Eagle3). To enable Eagle3 on Qwen3, you need to set the following arguments when running `trtllm-bench` or `trtllm-serve`:
898
+
899
+
-`speculative_config.decoding_type: Eagle`
900
+
Set the decoding type to "Eagle" to enable Eagle3 speculative decoding.
901
+
-`speculative_config.max_draft_len: 3`
902
+
Set the maximum number of draft tokens generated per step (this value can be adjusted as needed).
Specify the path to the Eagle3 draft model (ensure the corresponding draft model weights are prepared).
905
+
906
+
Currently, there are some limitations when enabling Eagle3:
907
+
908
+
1.`attention_dp` is not supported. Please disable it or do not set the related flag (it is disabled by default).
909
+
2. If you want to use `enable_block_reuse`, the kv cache type of the target model and the draft model must be the same. Since the draft model only supports fp16/bf16, you need to disable `enable_block_reuse` when using fp8 kv cache.
910
+
911
+
Example `extra-llm-api-config.yml` snippet for Eagle3:
912
+
913
+
```bash
914
+
echo"
915
+
enable_attention_dp: false
916
+
speculative_config:
917
+
decoding_type: Eagle
918
+
max_draft_len: 3
919
+
speculative_model_dir: <EAGLE3_DRAFT_MODEL_PATH>
920
+
kv_cache_config:
921
+
enable_block_reuse: false
922
+
">>${path_config}
923
+
```
924
+
925
+
For further details, please refer to [speculative-decoding.md](../../../../docs/source/advanced/speculative-decoding.md)
926
+
894
927
### Dynamo
895
928
896
929
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
0 commit comments