-
Notifications
You must be signed in to change notification settings - Fork 1.8k
support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… #5684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/bot run |
PR_Github #10696 [ run ] triggered by Bot |
PR_Github #10696 [ run ] completed with state |
f6a95d6
to
ac839cb
Compare
/bot run |
PR_Github #10859 [ run ] triggered by Bot |
PR_Github #10859 [ run ] completed with state |
/bot run |
PR_Github #10865 [ run ] triggered by Bot |
ac839cb
to
4233e7a
Compare
/bot run |
PR_Github #10868 [ run ] triggered by Bot |
PR_Github #10865 [ run ] completed with state |
PR_Github #10868 [ run ] completed with state |
4233e7a
to
d4df9b2
Compare
/bot run |
PR_Github #11186 [ run ] triggered by Bot |
PR_Github #11186 [ run ] completed with state |
d4df9b2
to
7024f73
Compare
/bot run |
PR_Github #11325 [ run ] triggered by Bot |
PR_Github #11325 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code related to self.use_postquant_alltoall
should include the use_all_to_all
check as well. Otherwise, there will be a scenario where the AG/RS path is taken, but the post quant all2all logic is triggered, just like the following code snippet.
if not disable_fp4_allgather() or self.use_postquant_alltoall:
if isinstance(x, Fp4QuantizedTensor):
x, x_sf = x.fp4_tensor, x.scaling_factor
x_row = x.shape[0]
# note: we use uint8 to store 2 fp4 values
x_col = x.shape[1] * 2
else:
sf_swizzle = not self.use_postquant_alltoall
x_row = x.shape[0]
x_col = x.shape[1]
x, x_sf = torch.ops.trtllm.fp4_quantize(
x, self.fc31_input_scale, self.scaling_vector_size,
False, sf_swizzle)
if self.use_postquant_alltoall:
x_sf = x_sf.view((x_row, -1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot variables related to communication method selection: enable_alltoall
, use_all_to_all
(from this PR), and use_allgather
(will be introduced by #5684 ), which makes the logic hard to follow.
This comment is not to block this PR, but suggest a future refactor on communication method selection.
(Discuss in the following comment)
cf24bf7
to
c4e969e
Compare
/bot run |
PR_Github #11682 [ run ] triggered by Bot |
PR_Github #11682 [ run ] completed with state |
c4e969e
to
c64b707
Compare
/bot reuse-pipeline |
PR_Github #11825 [ reuse-pipeline ] triggered by Bot |
PR_Github #11825 [ reuse-pipeline ] completed with state |
/bot run |
PR_Github #11828 [ run ] triggered by Bot |
PR_Github #11828 [ run ] completed with state |
/bot run |
PR_Github #11850 [ run ] triggered by Bot |
PR_Github #11850 [ run ] completed with state |
…strained GPUs. DeepEP requires additional RDMA memory for communication, and on memory-constrained GPUs, we may not have enough memory to enable DeepEP for both the context and decoding phases. In disaggregated serving scenarios, it's straightforward to enable DeepEP only on the decoding server. However, for inflight batching, we need to apply a token limit so that DeepEP is only used during decoding. Signed-off-by: Vincent Huang <[email protected]>
Signed-off-by: Vincent Huang <[email protected]>
Signed-off-by: Vincent Huang <[email protected]>
Signed-off-by: Vincent Huang <[email protected]>
c64b707
to
91fffb7
Compare
/bot run |
PR_Github #11859 [ run ] triggered by Bot |
PR_Github #11859 [ run ] completed with state |
NVIDIA#5684) Signed-off-by: Vincent Huang <[email protected]>
Hi @ttyio , |
Hi @yuantailing , I have not test MTP, for non-MTP, I used max_local_batch_size. |
Thank you! @ttyio |
none: support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs.
Description
DeepEP requires additional RDMA memory for communication, and on memory-constrained GPUs, we may not have enough memory to enable DeepEP for both the context and decoding phases. In disaggregated serving scenarios, it's straightforward to enable DeepEP only on the decoding server. However, for inflight batching, we need to apply a token limit so that DeepEP is only used during decoding.