Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow training speed when using sequence parallel #13

Open
kaixinbear opened this issue Dec 17, 2024 · 1 comment
Open

Slow training speed when using sequence parallel #13

kaixinbear opened this issue Dec 17, 2024 · 1 comment

Comments

@kaixinbear
Copy link

Thanks for your work!
I encounter a problem that when I adopt sq=2, the training speed is 60s/iter. It is 30s / iter when not using seq parallel.
Is it a normal phenomenon in your experiment ? Looking forward to your reply !

@flymin
Copy link
Owner

flymin commented Dec 17, 2024

sp indeed introduces some overhead on communication, but typically it will speed up the training due to lower computation on single GPUs. Please check the communication speed to profile. For example, does sp trigger inter-node communication (you do not want the inter-node communication with sp<8)? Does it use the NCCL backend (and with the IB network)?

You can use this option to see more details:

record_time = cfg.get("record_time", False)

To compare the speed for different runs, please also make sure the world sizes are the same. Otherwise, the optimizer sharding will introduce overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants