Added MCore FSDP support for TE #1890

sanandaraj5597 · 2025-06-18T05:47:54Z

Description

Added support for gradient accumulation fusion when using MCore FSDP with TE.

Signed-off-by: Selvaraj Anandaraj <[email protected]>

for more information, see https://pre-commit.ci

shjwudp · 2025-06-18T06:27:40Z

Thank you! This is very beneficial for improving FSDP's performance. I think it would be best to make the memory allocation for main_grad an explicit function. What do you think?

Signed-off-by: Selvaraj Anandaraj <[email protected]>

sanandaraj5597 · 2025-06-18T17:45:59Z

@shjwudp I think I updated the code based on the recent code factor. Please check once.

timmoon10

Is weight.main_grad available during the backward pass or do you need to go through weight.get_main_grad()? The op fuser API tries to access main_grad in the backward pass:

TransformerEngine/transformer_engine/pytorch/ops/basic/basic_linear.py

Lines 980 to 989 in 3a298e6

    
           if ctx.weight_requires_grad and accumulate_into_main_grad: 
        
               if not hasattr(self.weight, "main_grad"): 
        
                   raise RuntimeError( 
        
                       "BasicLinear op is configured with " 
        
                       "accumulate_into_main_grad=True, " 
        
                       "but weight parameter does not have main_grad attribute" 
        
                   ) 
        
               grad_weight = self.weight.main_grad.detach() 
        
           else: 
        
               accumulate_into_main_grad = False

sanandaraj5597 · 2025-06-18T18:09:04Z

@timmoon10 that's a good point. When we use TE with MCore FSDP, the main grads buffers are lazily allocated during backward. So, if we are using the fuser with FSDP then we need to allocate the buffer and then use it.

Selvaraj Anandaraj and others added 2 commits June 17, 2025 22:45

Added MCore fsdp support for TE

ffa5223

Signed-off-by: Selvaraj Anandaraj <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

304c467

for more information, see https://pre-commit.ci

Selvaraj Anandaraj added 2 commits June 18, 2025 10:42

Refactored based on new MCore FSDP

3880a34

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Fixed conflicts

fbd850f

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Merge branch 'main' into nvfsdp_gradient_accum_fusion

b3aa2ad

timmoon10 reviewed Jun 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added MCore FSDP support for TE #1890

Added MCore FSDP support for TE #1890

Uh oh!

sanandaraj5597 commented Jun 18, 2025

Uh oh!

shjwudp commented Jun 18, 2025

Uh oh!

sanandaraj5597 commented Jun 18, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

sanandaraj5597 commented Jun 18, 2025

Uh oh!

Uh oh!

	if ctx.weight_requires_grad and accumulate_into_main_grad:
	if not hasattr(self.weight, "main_grad"):
	raise RuntimeError(
	"BasicLinear op is configured with "
	"accumulate_into_main_grad=True, "
	"but weight parameter does not have main_grad attribute"
	)
	grad_weight = self.weight.main_grad.detach()
	else:
	accumulate_into_main_grad = False

Added MCore FSDP support for TE #1890

Are you sure you want to change the base?

Added MCore FSDP support for TE #1890

Uh oh!

Conversation

sanandaraj5597 commented Jun 18, 2025

Description

Uh oh!

shjwudp commented Jun 18, 2025

Uh oh!

sanandaraj5597 commented Jun 18, 2025

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

sanandaraj5597 commented Jun 18, 2025

Uh oh!

Uh oh!