-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor Parallel performance is not better than eager mode. #36222
Comments
Do you mean multi-GPU performance is not better than single GPU? From your report, I see that your system uses PCI-e instead of NVLinks:
In case of slow interconnect, I would recommend using Pipeline Parallel (PP) instead of Tensor Parallel because it is better at comm latency hiding. And it will increase the system throughput by the number of GPUs, of course, in ideal situation. |
Yeah, the above benchmark is from a 8x H100 machine with fully connected NVLinks. Your benchmark script may not capture the actual time needed though.
The time between |
You'd need to
before recording the end timestamp. |
Hi @kwen2501 . Thanks for your reminding. I have updated the script and performance table, still can't get the acceleration. It would be great if you could give me more details about |
Thanks @kwen2501 for helping! 🤗 I believe we can close this now? 🤗 |
Hi @kwen2501 . I found a new issue. The model weight shape remains the same despite the tp size. I add shape = model.model.layers[0].self_attn.q_proj.weight.shape
print(f"weight shape is {shape}") after loading model in my script. The output is always I suppose each card only saves a part of this model if we running TP. Is this as expected? |
You are seeing the same shape as before because |
Hi @kwen2501 . Do you know how to check the sharded shape? I want to know which model part on each GPU to avoid unbalanced split. |
System Info
docker image:
nvcr.io/nvidia/pytorch:25.01-py3
Hardware: Nvidia A100
Who can help?
@SunMarc @ArthurZucker @kwen2501
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
CMD:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 4 run_tp_hf.py
Expected behavior
Latency Performance (ms):
tp_size is world_size
The speed-up is not expected as doc claimed.
Related PR: 34184
The text was updated successfully, but these errors were encountered: