-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, do you know why the synchronization time from 4pi to 8pi suddenly increases? #20
Comments
Hello, what a router are you using? |
Thank you for your reply. It's just based on the results in your first form. For example,Llama 2 7B,from 34.06ms to 289.75, it has increased to nearly 10 times. What factors do you think restrict the communication between devices? |
Could you put a link to these results? Normally, the synchronization time is very similar during the inference like here. |
I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load. The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc. For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!). Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB). |
Thanks for your explanation. |
Thank you, have you implemented any experiments on a high-end switch, e.g. Google Cloud service? If the result supports the poor switch hypothesis, then the bottleneck of this repo is not communication overhead. |
@zhengpeirong please check this report (4 x c3d-highcpu-30 / Google Cloud). For Llama 7B / Q40 Weights Q80 Buffer I got:
The data needed for the synchronization per 1 token (Q80 Buffer):
So yes, if you have enough fast link between nodes the communication is not the bottleneck. Btw: USB4 link may achieve 10Gbps. Google Cloud is much much slower than this. |
Thank you!! I have checked the specifications of the switch: |
@b4rtaz I recommend making synchronization more often with smaller chunks instead of the whole QKV or FFN result. This solution should benefit the transfer time by reducing the communication overhead to the level of a low-end switch. |
Today, I tried a minor adjustment to the order of synchronization:
Setup: 4 x Raspberry Pi 5 8GB, Llama 3 8B Q40, Q80 Buffer, TP-Link LS1008G Switch. Results: Edit: I'm hidding these results because it contains an error, check the discussion belowcurrent
newCommit: d5b8354
Conclusions: it seems this change reduced the synchronization time by 12ms / token, what is a very good improvement. It looks like there is more to improve if this works. |
I found the llamaSyncAttQ, llamaSyncAttK, llamaSyncAttV task all of them set to TASK_TYPE_INFERENCE. That maybe affect the transfer time statistics. |
Setup: the same as before. 0.3.0Commit: ad10e18
0.3.1Commit: 7f63f9e
So we have for 0.3.0 = My setup looks very non-deterministic. Yesterday I observed the average inference time close to Yestarday I achieved a similar average for the transfer time for old version as today for new. So I think my tests cannot prove or disprove that if this approach is better. |
So for now I reverted these changes to the old one. The previous implementation is easier to maintain. |
No description provided.
The text was updated successfully, but these errors were encountered: