Question for the performance of tensor parallelism #4525
-
Hi, thanks for this awesome work! Currently, I am trying to benchmark the inference performance for LLaMAv2 with the tensor parallelism ShardFormer provided. To benchmark the LLaMAv2, I modify the example script in The experiment settings are shown as follows:
In my imagination, with tensor parallelism, we have to pay more cost to communicate the activation between each GPU, which leads to slower than without tensor parallelism. But the tensor parallelism can provide significant GPU memory reduction for us with slight communication cost.
In the experiment, I found the inference speed of tensor parallel is almost 3 times faster than without tensor parallel, and even faster than Is there any wrong with this experiment result? In each experiment, I only modify the configuration of Besides, I found that flash attn does not provide the performance improvement the original paper claimed. Is this a common phenomenon? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
This is normal. ShardFormer splits the model, and due to the longer sequence length of LLaMA2, ShardFormer can effectively increase the inference speed. |
Beta Was this translation helpful? Give feedback.
-
Hi @flybird11111, I have a further question. In my cognition, And the last question, can we use the Can you recommend a good configuration combination (e.g., In my case, flash attention only provides a 15% speed-up in forward pass and backward pass. Look forward to your benchmark result!!! |
Beta Was this translation helpful? Give feedback.
Shardformer implements tensor parallelism strategy specifically for transformer models. which effectively reduces communication costs. You can leverage the features provided by the Hybrid Parallel Plugin. https://github.com/hpcaitech/ColossalAI/blob/feature/shardformer/colossalai/booster/plugin/hybrid_parallel_plugin.py