Skip to content

[BUG] selective activation recompute only decrease little of GPU memory usage during training #1253

Answered by jpy794
bugm asked this question in Q&A
Discussion options

You must be logged in to vote

--recompute-activations doesn't make any difference here because by default Megatron-LM uses TransformerEngine backend. And fused attention kernels like flash-attention always do selective activation checkpointing. You can set --transformer-impl local, and --recompute-activations would behave as described in paper. Maybe we should document this for clarity?
https://github.com/Dao-AILab/flash-attention/blob/641db759ab7168e472909bc9ff1eda4a329de34f/flash_attn/flash_attn_interface.py#L915

Replies: 3 comments 8 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
8 replies
@eliird
Comment options

@jpy794
Comment options

@eliird
Comment options

@eliird
Comment options

@jpy794
Comment options

Answer selected by bugm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
5 participants
Converted from issue

This discussion was converted from issue #1225 on October 23, 2024 21:26.