Skip to content

Added MCore FSDP support for TE #1890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sanandaraj5597
Copy link
Contributor

Description

Added support for gradient accumulation fusion when using MCore FSDP with TE.

Selvaraj Anandaraj and others added 2 commits June 17, 2025 22:45
@shjwudp
Copy link
Contributor

shjwudp commented Jun 18, 2025

Thank you! This is very beneficial for improving FSDP's performance. I think it would be best to make the memory allocation for main_grad an explicit function. What do you think?

Selvaraj Anandaraj added 2 commits June 18, 2025 10:42
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
@sanandaraj5597
Copy link
Contributor Author

@shjwudp I think I updated the code based on the recent code factor. Please check once.

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is weight.main_grad available during the backward pass or do you need to go through weight.get_main_grad()? The op fuser API tries to access main_grad in the backward pass:

if ctx.weight_requires_grad and accumulate_into_main_grad:
if not hasattr(self.weight, "main_grad"):
raise RuntimeError(
"BasicLinear op is configured with "
"accumulate_into_main_grad=True, "
"but weight parameter does not have main_grad attribute"
)
grad_weight = self.weight.main_grad.detach()
else:
accumulate_into_main_grad = False

@sanandaraj5597
Copy link
Contributor Author

@timmoon10 that's a good point. When we use TE with MCore FSDP, the main grads buffers are lazily allocated during backward. So, if we are using the fuser with FSDP then we need to allocate the buffer and then use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants