Skip to content

Conversation

ttyio
Copy link
Collaborator

@ttyio ttyio commented Jul 2, 2025

none: support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs.

Description

DeepEP requires additional RDMA memory for communication, and on memory-constrained GPUs, we may not have enough memory to enable DeepEP for both the context and decoding phases. In disaggregated serving scenarios, it's straightforward to enable DeepEP only on the decoding server. However, for inflight batching, we need to apply a token limit so that DeepEP is only used during decoding.

@ttyio ttyio requested a review from a team as a code owner July 2, 2025 18:43
@ttyio ttyio requested review from lucaslie and juney-nvidia July 2, 2025 18:43
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10696 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10696 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #7906 completed with status: 'ABORTED'

@ttyio ttyio force-pushed the vincenth/wide-ep-token-limit2 branch 2 times, most recently from f6a95d6 to ac839cb Compare July 3, 2025 18:12
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10859 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10859 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8025 completed with status: 'FAILURE'

@ttyio
Copy link
Collaborator Author

ttyio commented Jul 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10865 [ run ] triggered by Bot

@ttyio ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from ac839cb to 4233e7a Compare July 3, 2025 21:12
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10868 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10865 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10868 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8034 completed with status: 'FAILURE'

@ttyio ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from 4233e7a to d4df9b2 Compare July 7, 2025 22:38
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 7, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11186 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11186 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8275 completed with status: 'FAILURE'

@ttyio ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from d4df9b2 to 7024f73 Compare July 8, 2025 15:59
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 8, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11325 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11325 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8378 completed with status: 'SUCCESS'

@ttyio ttyio requested a review from yuantailing July 8, 2025 22:41
Copy link
Collaborator

@yilin-void yilin-void left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code related to self.use_postquant_alltoall should include the use_all_to_all check as well. Otherwise, there will be a scenario where the AG/RS path is taken, but the post quant all2all logic is triggered, just like the following code snippet.

if not disable_fp4_allgather() or self.use_postquant_alltoall:
    if isinstance(x, Fp4QuantizedTensor):
        x, x_sf = x.fp4_tensor, x.scaling_factor
        x_row = x.shape[0]
        # note: we use uint8 to store 2 fp4 values
        x_col = x.shape[1] * 2
    else:
        sf_swizzle = not self.use_postquant_alltoall
        x_row = x.shape[0]
        x_col = x.shape[1]
        x, x_sf = torch.ops.trtllm.fp4_quantize(
            x, self.fc31_input_scale, self.scaling_vector_size,
            False, sf_swizzle)
        if self.use_postquant_alltoall:
            x_sf = x_sf.view((x_row, -1))

Copy link
Member

@yuantailing yuantailing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot variables related to communication method selection: enable_alltoall, use_all_to_all (from this PR), and use_allgather (will be introduced by #5684 ), which makes the logic hard to follow.

This comment is not to block this PR, but suggest a future refactor on communication method selection.

(Discuss in the following comment)

@ttyio ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from cf24bf7 to c4e969e Compare July 11, 2025 23:07
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 11, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11682 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11682 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8649 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@ttyio ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from c4e969e to c64b707 Compare July 14, 2025 15:50
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 14, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11825 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11825 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #11682 for commit c64b707

@ttyio
Copy link
Collaborator Author

ttyio commented Jul 14, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11828 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11828 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8764 completed with status: 'FAILURE'

@ttyio
Copy link
Collaborator Author

ttyio commented Jul 14, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11850 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11850 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8781 completed with status: 'FAILURE'

ttyio added 4 commits July 14, 2025 18:09
…strained GPUs.

DeepEP requires additional RDMA memory for communication, and on memory-constrained GPUs, we may not have enough memory to enable DeepEP for both the context and decoding phases. In disaggregated serving scenarios, it's straightforward to enable DeepEP only on the decoding server. However, for inflight batching, we need to apply a token limit so that DeepEP is only used during decoding.

Signed-off-by: Vincent Huang <[email protected]>
Signed-off-by: Vincent Huang <[email protected]>
@ttyio ttyio force-pushed the vincenth/wide-ep-token-limit2 branch from c64b707 to 91fffb7 Compare July 15, 2025 01:09
@ttyio
Copy link
Collaborator Author

ttyio commented Jul 15, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11859 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11859 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8788 completed with status: 'SUCCESS'

@Naveassaf Naveassaf merged commit 0523f77 into NVIDIA:main Jul 15, 2025
3 checks passed
evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Jul 16, 2025
@yuantailing
Copy link
Member

Hi @ttyio ,
Do you have a suggested value for TRTLLM_DEEP_EP_TOKEN_LIMIT? Should it be equal to max_batch_size * (1 + MTP) or something else?

@ttyio
Copy link
Collaborator Author

ttyio commented Jul 25, 2025

Hi @ttyio , Do you have a suggested value for TRTLLM_DEEP_EP_TOKEN_LIMIT? Should it be equal to max_batch_size * (1 + MTP) or something else?

Hi @yuantailing , I have not test MTP, for non-MTP, I used max_local_batch_size.

@yuantailing
Copy link
Member

Thank you! @ttyio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants