support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… #5684

ttyio · 2025-07-02T18:43:43Z

none: support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs.

Description

DeepEP requires additional RDMA memory for communication, and on memory-constrained GPUs, we may not have enough memory to enable DeepEP for both the context and decoding phases. In disaggregated serving scenarios, it's straightforward to enable DeepEP only on the decoding server. However, for inflight batching, we need to apply a token limit so that DeepEP is only used during decoding.

ttyio · 2025-07-02T18:44:11Z

/bot run

tensorrt-cicd · 2025-07-02T18:49:06Z

PR_Github #10696 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-02T18:56:14Z

PR_Github #10696 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #7906 completed with status: 'ABORTED'

ttyio · 2025-07-03T18:12:58Z

/bot run

tensorrt-cicd · 2025-07-03T18:23:33Z

PR_Github #10859 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-03T20:33:58Z

PR_Github #10859 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8025 completed with status: 'FAILURE'

ttyio · 2025-07-03T20:35:22Z

/bot run

tensorrt-cicd · 2025-07-03T20:40:53Z

PR_Github #10865 [ run ] triggered by Bot

ttyio · 2025-07-03T21:12:48Z

/bot run

tensorrt-cicd · 2025-07-03T21:19:47Z

PR_Github #10868 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-03T21:19:50Z

PR_Github #10865 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-03T23:29:37Z

PR_Github #10868 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8034 completed with status: 'FAILURE'

ttyio · 2025-07-07T22:39:02Z

/bot run

tensorrt-cicd · 2025-07-07T22:45:24Z

PR_Github #11186 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T02:30:58Z

PR_Github #11186 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8275 completed with status: 'FAILURE'

ttyio · 2025-07-08T15:59:10Z

/bot run

tensorrt-cicd · 2025-07-08T16:08:46Z

PR_Github #11325 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T22:39:17Z

PR_Github #11325 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8378 completed with status: 'SUCCESS'

yilin-void

The code related to self.use_postquant_alltoall should include the use_all_to_all check as well. Otherwise, there will be a scenario where the AG/RS path is taken, but the post quant all2all logic is triggered, just like the following code snippet.

if not disable_fp4_allgather() or self.use_postquant_alltoall:
    if isinstance(x, Fp4QuantizedTensor):
        x, x_sf = x.fp4_tensor, x.scaling_factor
        x_row = x.shape[0]
        # note: we use uint8 to store 2 fp4 values
        x_col = x.shape[1] * 2
    else:
        sf_swizzle = not self.use_postquant_alltoall
        x_row = x.shape[0]
        x_col = x.shape[1]
        x, x_sf = torch.ops.trtllm.fp4_quantize(
            x, self.fc31_input_scale, self.scaling_vector_size,
            False, sf_swizzle)
        if self.use_postquant_alltoall:
            x_sf = x_sf.view((x_row, -1))

yuantailing

There are a lot variables related to communication method selection: enable_alltoall, use_all_to_all (from this PR), and use_allgather (will be introduced by #5684 ), which makes the logic hard to follow.

~~This comment is not to block this PR, but suggest a future refactor on communication method selection.~~

(Discuss in the following comment)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

ttyio · 2025-07-11T23:07:57Z

/bot run

tensorrt-cicd · 2025-07-11T23:13:56Z

PR_Github #11682 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-12T05:50:38Z

PR_Github #11682 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8649 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

ttyio · 2025-07-14T15:52:38Z

/bot reuse-pipeline

tensorrt-cicd · 2025-07-14T15:58:18Z

PR_Github #11825 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-07-14T16:11:15Z

PR_Github #11825 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #11682 for commit c64b707

ttyio · 2025-07-14T16:15:45Z

/bot run

tensorrt-cicd · 2025-07-14T16:23:34Z

PR_Github #11828 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T21:51:52Z

PR_Github #11828 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8764 completed with status: 'FAILURE'

ttyio · 2025-07-14T21:52:47Z

/bot run

tensorrt-cicd · 2025-07-14T21:57:51Z

PR_Github #11850 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T23:52:16Z

PR_Github #11850 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8781 completed with status: 'FAILURE'

…strained GPUs. DeepEP requires additional RDMA memory for communication, and on memory-constrained GPUs, we may not have enough memory to enable DeepEP for both the context and decoding phases. In disaggregated serving scenarios, it's straightforward to enable DeepEP only on the decoding server. However, for inflight batching, we need to apply a token limit so that DeepEP is only used during decoding. Signed-off-by: Vincent Huang <[email protected]>

Signed-off-by: Vincent Huang <[email protected]>

ttyio · 2025-07-15T01:09:26Z

/bot run

tensorrt-cicd · 2025-07-15T01:15:08Z

PR_Github #11859 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-15T06:39:26Z

PR_Github #11859 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8788 completed with status: 'SUCCESS'

NVIDIA#5684) Signed-off-by: Vincent Huang <[email protected]>

yuantailing · 2025-07-25T07:49:55Z

Hi @ttyio ,
Do you have a suggested value for TRTLLM_DEEP_EP_TOKEN_LIMIT? Should it be equal to max_batch_size * (1 + MTP) or something else?

ttyio · 2025-07-25T14:17:40Z

Hi @ttyio , Do you have a suggested value for TRTLLM_DEEP_EP_TOKEN_LIMIT? Should it be equal to max_batch_size * (1 + MTP) or something else?

Hi @yuantailing , I have not test MTP, for non-MTP, I used max_local_batch_size.

yuantailing · 2025-07-25T15:53:31Z

Thank you! @ttyio

ttyio requested a review from a team as a code owner July 2, 2025 18:43

ttyio requested review from lucaslie and juney-nvidia July 2, 2025 18:43

ttyio force-pushed the vincenth/wide-ep-token-limit2 branch 2 times, most recently from f6a95d6 to ac839cb Compare July 3, 2025 18:12

ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from ac839cb to 4233e7a Compare July 3, 2025 21:12

ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from 4233e7a to d4df9b2 Compare July 7, 2025 22:38

ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from d4df9b2 to 7024f73 Compare July 8, 2025 15:59

ttyio requested a review from yuantailing July 8, 2025 22:41

yilin-void reviewed Jul 9, 2025

View reviewed changes

yuantailing reviewed Jul 9, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py Outdated Show resolved Hide resolved

hlu1 reviewed Jul 11, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Outdated Show resolved Hide resolved

hlu1 reviewed Jul 11, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Outdated Show resolved Hide resolved

ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from cf24bf7 to c4e969e Compare July 11, 2025 23:07

ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from c4e969e to c64b707 Compare July 14, 2025 15:50

ttyio added 4 commits July 14, 2025 18:09

check use_alltoall in use_post_quant_alltoall

998cc6a

Signed-off-by: Vincent Huang <[email protected]>

all2all check from both chunking and threshold

2ebca12

Signed-off-by: Vincent Huang <[email protected]>

update after review

91fffb7

Signed-off-by: Vincent Huang <[email protected]>

ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from c64b707 to 91fffb7 Compare July 15, 2025 01:09

yuantailing approved these changes Jul 15, 2025

View reviewed changes

Naveassaf merged commit 0523f77 into NVIDIA:main Jul 15, 2025
3 checks passed

evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Jul 16, 2025

support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… (

53add36

NVIDIA#5684) Signed-off-by: Vincent Huang <[email protected]>

support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… #5684

support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… #5684

Uh oh!

Conversation

ttyio commented Jul 2, 2025

none: support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs.

Description

Uh oh!

ttyio commented Jul 2, 2025

Uh oh!

tensorrt-cicd commented Jul 2, 2025

Uh oh!

tensorrt-cicd commented Jul 2, 2025

Uh oh!

ttyio commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

ttyio commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

ttyio commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

ttyio commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

ttyio commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

yilin-void left a comment

Choose a reason for hiding this comment

Uh oh!

yuantailing left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ttyio commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 12, 2025

Uh oh!

ttyio commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

ttyio commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

ttyio commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

ttyio commented Jul 15, 2025

Uh oh!

yuantailing left a comment •

edited

Loading