[QUESTION] Gradient Propagation in backward pass #1312

arul-lm · 2024-12-05T14:52:02Z

Setup: 2D parallel [Tensor + data]
Model: Assume a 1 layer GPT style transformer block

Does the backward prop start with a big loss gradient all-reduce where GPUs are idle?
Inside the transformer block, in megatron style tensor parallelism, there are 2 all-reduces in the backward prop. During these all-reduces, is there a chance of compute-communication overlap?
Assume an embedding layer, during the backward prop, is there an all-to-all communication to update the embedding weights?

I'm trying to model these steps for my project to explain to a larger group with a timing diagram. Hence the above questions. I want to make sure that I accurately present the steps involved

Thanks for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Gradient Propagation in backward pass #1312

[QUESTION] Gradient Propagation in backward pass #1312

arul-lm commented Dec 5, 2024

[QUESTION] Gradient Propagation in backward pass #1312

[QUESTION] Gradient Propagation in backward pass #1312

Comments

arul-lm commented Dec 5, 2024