Replies: 1 comment
-
Marking as stale. No activity in 60 days. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
An issue is identified when we test megatron-LM by going with 6B model with 1K GPUs. Basically, by checking the output of each iteration, we found the difference of min and max of params-all-gather with respect a specific iteration is too much as following. In this case, the TFLOPs is low.
By checking the code, we believe this issue is caused by the barrier in the timers as illustrated below (distrib_optimizer.py:step()).
To debug this issue, we added 1 line to print the timestamp after call as following:
By checking the timestamps printed by different ranks, we found the maximum time difference between the ranks is 200+ms which is too big. It would be highly appreciated if any clue of this big time difference. Many thanks in adance.
Beta Was this translation helpful? Give feedback.
All reactions