Slow training speed when using sequence parallel #13

kaixinbear · 2024-12-17T05:11:47Z

Thanks for your work!
I encounter a problem that when I adopt sq=2, the training speed is 60s/iter. It is 30s / iter when not using seq parallel.
Is it a normal phenomenon in your experiment ? Looking forward to your reply !

flymin · 2024-12-17T05:59:52Z

sp indeed introduces some overhead on communication, but typically it will speed up the training due to lower computation on single GPUs. Please check the communication speed to profile. For example, does sp trigger inter-node communication (you do not want the inter-node communication with sp<8)? Does it use the NCCL backend (and with the IB network)?

You can use this option to see more details:

MagicDriveDiT/scripts/train_magicdrive.py

Line 93 in d537ecf

record_time = cfg.get("record_time", False)

To compare the speed for different runs, please also make sure the world sizes are the same. Otherwise, the optimizer sharding will introduce overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow training speed when using sequence parallel #13

Slow training speed when using sequence parallel #13

kaixinbear commented Dec 17, 2024

flymin commented Dec 17, 2024

Slow training speed when using sequence parallel #13

Slow training speed when using sequence parallel #13

Comments

kaixinbear commented Dec 17, 2024

flymin commented Dec 17, 2024