-
Notifications
You must be signed in to change notification settings - Fork 609
fix bug: dp+tp warmup #3991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix bug: dp+tp warmup #3991
Conversation
48b2093
to
bbb0774
Compare
Note that we might change the behaviour of DP+TP in the future. |
Could you tell me what it's specifically about? |
Current implementation would pad inputs to the same batch size(input_meta). For each layer, the pipeline would be I want to decouple DP and TP. DP2 TP2 would use 4 GPU, each DP rank would be single engine with TP2. This is good for DP+TP+EP (less padding and less collective OP). I have not finish planning yet, any advices are welcome. |
For models utilizing GQA or MHA, this is indeed a superior solution. However, for MLA or MQA architectures, employing TP cannot partition the KV cache based on the number of heads. This results in each GPU storing a full replica of the KV cache. In such cases, DP can be applied specifically to the attention component to reduce the GPU memory footprint of the KV cache on individual cards. Therefore, I believe that retaining the current implementation approach in future versions remains a viable strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi, @Tsundoku958 |
#4004 This is the DP-TP refactor with both TP implementations. Feel free to review the PR. |
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
During the warmup phase of LMDeploy when using Data Parallelism (DP) + Tensor Parallelism (TP), the build_dp_meta() function is not invoked.
Reproduction:
GPU: H20
Command:
Traceback:

Modification
Call inputs.build_dp_meta() before call _forward_impl
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist