[QUESTION] Why take too much time to sync up barrier information between ranks #1175

yanminjia · 2024-03-20T15:02:37Z

yanminjia
Mar 20, 2024

An issue is identified when we test megatron-LM by going with 6B model with 1K GPUs. Basically, by checking the output of each iteration, we found the difference of min and max of params-all-gather with respect a specific iteration is too much as following. In this case, the TFLOPs is low.

By checking the code, we believe this issue is caused by the barrier in the timers as illustrated below (distrib_optimizer.py:step()).

        timers('params-all-gather', log_level=1).start(barrier=args.barrier_with_L1_time)
        self._reset_metadata_and_sync_gather_all_model_params(force_sync=False)
        timers('params-all-gather').stop()

To debug this issue, we added 1 line to print the timestamp after call as following:

        timers('params-all-gather', log_level=1).start(barrier=args.barrier_with_L1_time)
        print("go out of the barrier of timer: ", time.time())
        self._reset_metadata_and_sync_gather_all_model_params(force_sync=False)
        timers('params-all-gather').stop()

By checking the timestamps printed by different ranks, we found the maximum time difference between the ranks is 200+ms which is too big. It would be highly appreciated if any clue of this big time difference. Many thanks in adance.

2024-05-20T18:20:35Z

github-actions[bot]
bot May 20, 2024

Marking as stale. No activity in 60 days.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Why take too much time to sync up barrier information between ranks #1175

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[QUESTION] Why take too much time to sync up barrier information between ranks #1175

yanminjia Mar 20, 2024

Replies: 1 comment

github-actions[bot] bot May 20, 2024

yanminjia
Mar 20, 2024

github-actions[bot]
bot May 20, 2024