-
Notifications
You must be signed in to change notification settings - Fork 434
Added MCore FSDP support for TE #1890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added MCore FSDP support for TE #1890
Conversation
Signed-off-by: Selvaraj Anandaraj <[email protected]>
for more information, see https://pre-commit.ci
Thank you! This is very beneficial for improving FSDP's performance. I think it would be best to make the memory allocation for main_grad an explicit function. What do you think? |
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
@shjwudp I think I updated the code based on the recent code factor. Please check once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is weight.main_grad
available during the backward pass or do you need to go through weight.get_main_grad()
? The op fuser API tries to access main_grad
in the backward pass:
TransformerEngine/transformer_engine/pytorch/ops/basic/basic_linear.py
Lines 980 to 989 in 3a298e6
if ctx.weight_requires_grad and accumulate_into_main_grad: | |
if not hasattr(self.weight, "main_grad"): | |
raise RuntimeError( | |
"BasicLinear op is configured with " | |
"accumulate_into_main_grad=True, " | |
"but weight parameter does not have main_grad attribute" | |
) | |
grad_weight = self.weight.main_grad.detach() | |
else: | |
accumulate_into_main_grad = False |
@timmoon10 that's a good point. When we use TE with MCore FSDP, the main grads buffers are lazily allocated during backward. So, if we are using the fuser with FSDP then we need to allocate the buffer and then use it. |
Description
Added support for gradient accumulation fusion when using MCore FSDP with TE.