You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does the backward prop start with a big loss gradient all-reduce where GPUs are idle?
Inside the transformer block, in megatron style tensor parallelism, there are 2 all-reduces in the backward prop. During these all-reduces, is there a chance of compute-communication overlap?
Assume an embedding layer, during the backward prop, is there an all-to-all communication to update the embedding weights?
I'm trying to model these steps for my project to explain to a larger group with a timing diagram. Hence the above questions. I want to make sure that I accurately present the steps involved
Thanks for your time!
The text was updated successfully, but these errors were encountered:
Setup: 2D parallel [Tensor + data]
Model: Assume a 1 layer GPT style transformer block
Does the backward prop start with a big loss gradient all-reduce where GPUs are idle?
Inside the transformer block, in megatron style tensor parallelism, there are 2 all-reduces in the backward prop. During these all-reduces, is there a chance of compute-communication overlap?
Assume an embedding layer, during the backward prop, is there an all-to-all communication to update the embedding weights?
I'm trying to model these steps for my project to explain to a larger group with a timing diagram. Hence the above questions. I want to make sure that I accurately present the steps involved
Thanks for your time!
The text was updated successfully, but these errors were encountered: