Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Gradient Propagation in backward pass #1312

Open
arul-lm opened this issue Dec 5, 2024 · 0 comments
Open

[QUESTION] Gradient Propagation in backward pass #1312

arul-lm opened this issue Dec 5, 2024 · 0 comments

Comments

@arul-lm
Copy link

arul-lm commented Dec 5, 2024

Setup: 2D parallel [Tensor + data]
Model: Assume a 1 layer GPT style transformer block

  1. Does the backward prop start with a big loss gradient all-reduce where GPUs are idle?
    Image

  2. Inside the transformer block, in megatron style tensor parallelism, there are 2 all-reduces in the backward prop. During these all-reduces, is there a chance of compute-communication overlap?

  3. Assume an embedding layer, during the backward prop, is there an all-to-all communication to update the embedding weights?

I'm trying to model these steps for my project to explain to a larger group with a timing diagram. Hence the above questions. I want to make sure that I accurately present the steps involved

Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant