You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your work!
I encounter a problem that when I adopt sq=2, the training speed is 60s/iter. It is 30s / iter when not using seq parallel.
Is it a normal phenomenon in your experiment ? Looking forward to your reply !
The text was updated successfully, but these errors were encountered:
sp indeed introduces some overhead on communication, but typically it will speed up the training due to lower computation on single GPUs. Please check the communication speed to profile. For example, does sp trigger inter-node communication (you do not want the inter-node communication with sp<8)? Does it use the NCCL backend (and with the IB network)?
To compare the speed for different runs, please also make sure the world sizes are the same. Otherwise, the optimizer sharding will introduce overhead.
Thanks for your work!
I encounter a problem that when I adopt sq=2, the training speed is 60s/iter. It is 30s / iter when not using seq parallel.
Is it a normal phenomenon in your experiment ? Looking forward to your reply !
The text was updated successfully, but these errors were encountered: