Question for the performance of tensor parallelism #4525

eric8607242 · 2023-08-28T06:33:21Z

eric8607242
Aug 28, 2023

Hi, thanks for this awesome work!

Currently, I am trying to benchmark the inference performance for LLaMAv2 with the tensor parallelism ShardFormer provided.
After benchmarking, I get an unexpected result.
Hope experts here can help me understand the results.

To benchmark the LLaMAv2, I modify the example script in ColossalAI/examples/language/gpt/gemini/train_gpt_demo.py for LLaMAv2 and ShardFormer.

The experiment settings are shown as follows:

GPU num: 4 (A100)
Batch size (per device): 4
Seq length: 1024
Displan: CAI_ZeRO2

In my imagination, with tensor parallelism, we have to pay more cost to communicate the activation between each GPU, which leads to slower than without tensor parallelism. But the tensor parallelism can provide significant GPU memory reduction for us with slight communication cost.
However, the following are the benchmark results of my experiment:

zero2:
   - Fwd Time: 0.383 sec
   - Bwd Time: 1.19 sec
   - Optim Time: 0.37 sec
   - GPU memory (per device): 61.3 GB
zero2 + tensor parallel:
   - Fwd Time: 0.15 sec
   - Bwd Time: 0.484 sec
   - Optim Time: 0.09 sec
   - GPU memory (per device): 17 GB
zero2 + w. flash-attn2:
   - Fwd Time: 0.33 sec
   - Bwd Time: 1.04 sec
   - Optim Time: 0.26 sec
   - GPU memory (per device): 58.75 GB
zero2 + tensor parallel  + w. flash-attn2:
   - Fwd Time: 0.143 sec
   - Bwd Time: 0.44 sec
   - Optim Time: 0.092 sec
   - GPU memory (per device): 16 GB

In the experiment, I found the inference speed of tensor parallel is almost 3 times faster than without tensor parallel, and even faster than ZeRO3 configuration.

Is there any wrong with this experiment result? In each experiment, I only modify the configuration of ShardConfig only.
And are there any recommended combinations if I want to fine-tune LLaMA?

Besides, I found that flash attn does not provide the performance improvement the original paper claimed. Is this a common phenomenon?

Answered by flybird11111

Aug 30, 2023

Shardformer implements tensor parallelism strategy specifically for transformer models. which effectively reduces communication costs. You can leverage the features provided by the Hybrid Parallel Plugin. https://github.com/hpcaitech/ColossalAI/blob/feature/shardformer/colossalai/booster/plugin/hybrid_parallel_plugin.py

View full answer

flybird11111 · 2023-08-28T13:53:37Z

flybird11111
Aug 28, 2023
Collaborator

Hi, thanks for this awesome work!

Currently, I am trying to benchmark the inference performance for LLaMAv2 with the tensor parallelism ShardFormer provided. After benchmarking, I get an unexpected result. Hope experts here can help me understand the results.

To benchmark the LLaMAv2, I modify the example script in ColossalAI/examples/language/gpt/gemini/train_gpt_demo.py for LLaMAv2 and ShardFormer.

The experiment settings are shown as follows:
GPU num: 4 (A100)
Batch size (per device): 4
Seq length: 1024
Displan: CAI_ZeRO2
In my imagination, with tensor parallelism, we have to pay more cost to communicate the activation between each GPU, which leads to slower than without tensor parallelism. But the tensor parallelism can provide significant GPU memory reduction for us with slight communication cost. However, the following are the benchmark results of my experiment:
zero2:
   - Fwd Time: 0.383 sec
   - Bwd Time: 1.19 sec
   - Optim Time: 0.37 sec
   - GPU memory (per device): 61.3 GB
zero2 + tensor parallel:
   - Fwd Time: 0.15 sec
   - Bwd Time: 0.484 sec
   - Optim Time: 0.09 sec
   - GPU memory (per device): 17 GB
zero2 + w. flash-attn2:
   - Fwd Time: 0.33 sec
   - Bwd Time: 1.04 sec
   - Optim Time: 0.26 sec
   - GPU memory (per device): 58.75 GB
zero2 + tensor parallel  + w. flash-attn2:
   - Fwd Time: 0.143 sec
   - Bwd Time: 0.44 sec
   - Optim Time: 0.092 sec
   - GPU memory (per device): 16 GB
In the experiment, I found the inference speed of tensor parallel is almost 3 times faster than without tensor parallel, and even faster than ZeRO3 configuration.

Is there any wrong with this experiment result? In each experiment, I only modify the configuration of ShardConfig only. And are there any recommended combinations if I want to fine-tune LLaMA?

Besides, I found that flash attn does not provide the performance improvement the original paper claimed. Is this a common phenomenon?

This is normal. ShardFormer splits the model, and due to the longer sequence length of LLaMA2, ShardFormer can effectively increase the inference speed.
We are currently benchmarking the performance of LLaMA2 using ColossalAI. You can stay updated on our progress.
In your use case, how much relative speedup did flash attn provide?

0 replies

eric8607242 · 2023-08-29T01:32:27Z

eric8607242
Aug 29, 2023
Author

Hi @flybird11111,
Thanks for your response! Help me a lot!

I have a further question.
In my benchmark experiment, the setting ZeRO2 + tensor parallelism is faster than the setting ZeRO3, and the GPU memory consumption is also less.

In my cognition, ZeRO3 and ZeRO2 + tensor parallelism are very similar, one is to pass parameters between GPUs, and the other is to pass activations between GPUs. Can you tell me why ZeRO2 + tensor parallelism can be almost three times faster than ZeRO3 in my case?
Is it because we do not have to communicate each activation for tensor parallelism, but we have to communicate each parameter for ZeRO3 when it is required to compute for each process?

And the last question, can we use the ZeRO3 and tensor parallelism at the same time? It always raises the exception when I use this setting in my configuration.

Can you recommend a good configuration combination (e.g., ZeRO3 or ZeRO2 + tensor parallelism) to me if I want to fine-tune a LLaMA-7B model?

In my case, flash attention only provides a 15% speed-up in forward pass and backward pass.

Look forward to your benchmark result!!!

4 replies

eric8607242 Aug 29, 2023
Author

If I increase the sequence length and the batch size, flash attn provide more significant performance improvement:

### batch_size (per device): 4, seq length: 2048
zero2 + tensor parallel:
   - Fwd Time: 0.34 sec
   - Bwd Time: 0.98 sec
   - Optim Time: 0.092 sec
   - GPU memory (per device): 23 GB

zero2 + tensor parallel + flash-attn 2:
   - Fwd Time: 0.27 sec
   - Bwd Time: 0.80 sec
   - Optim Time: 0.092 sec
   - GPU memory (per device): 20 GB

### batch_size (per device): 4, seq length: 4096
zero2 + tensor parallel:
   - Fwd Time: 0.847 sec
   - Bwd Time: 2.29 sec
   - Optim Time: 0.093 sec
   - GPU memory (per device): 35.6 GB

zero2 + tensor parallel + flash-attn 2:
   - Fwd Time: 0.529 sec
   - Bwd Time: 1.55 sec
   - Optim Time: 0.094 sec
   - GPU memory (per device): 28.3 GB


### batch_size (per device): 8, seq length: 4096
zero2 + tensor parallel:
   - Fwd Time: 1.71 sec
   - Bwd Time: 4.47 sec
   - Optim Time: 0.09 sec
   - GPU memory (per device): 57.1 GB

zero2 + tensor parallel + flash-attn 2:
   - Fwd Time: 1.02 sec
   - Bwd Time: 3.00 sec
   - Optim Time: 0.094 sec
   - GPU memory (per device): 43.1 GB

eric8607242 Aug 29, 2023
Author

A new performance benchmark result for ZeRO2 + tensor parallelism + flash attn 2 + fused normalization:

### batch_size (per device): 8, seq length: 4096
zero2 + tensor parallel + flash-attn 2 + fused normalization:
   - Fwd Time: 0.824 sec
   - Bwd Time: 2.36 sec
   - Optim Time: 0.094 sec
   - GPU memory (per device): 42.4 GB

flybird11111 Aug 30, 2023
Collaborator

Shardformer implements tensor parallelism strategy specifically for transformer models. which effectively reduces communication costs. You can leverage the features provided by the Hybrid Parallel Plugin. https://github.com/hpcaitech/ColossalAI/blob/feature/shardformer/colossalai/booster/plugin/hybrid_parallel_plugin.py

Answer selected by eric8607242

eric8607242 Aug 31, 2023
Author

Thanks for your answer! Help me a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question for the performance of tensor parallelism #4525

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question for the performance of tensor parallelism #4525

eric8607242 Aug 28, 2023

Replies: 2 comments · 4 replies

flybird11111 Aug 28, 2023 Collaborator

eric8607242 Aug 29, 2023 Author

eric8607242 Aug 29, 2023 Author

eric8607242 Aug 29, 2023 Author

flybird11111 Aug 30, 2023 Collaborator

eric8607242 Aug 31, 2023 Author

eric8607242
Aug 28, 2023

Replies: 2 comments 4 replies

flybird11111
Aug 28, 2023
Collaborator

eric8607242
Aug 29, 2023
Author

eric8607242 Aug 29, 2023
Author

eric8607242 Aug 29, 2023
Author

flybird11111 Aug 30, 2023
Collaborator

eric8607242 Aug 31, 2023
Author