-
Describe the bug with my training set, Here are some info about the GPU memory usage without --recompute-activations I notice the max_memory allocated during training only decreased from 25.52GB to 24.94 GB. Expected behavior Environment (please complete the following information):
Additional context |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Thanks for sharing the issue. We currently do not have bandwidth to investigate more, and we will rely on the community to explore more on the issue. |
Beta Was this translation helpful? Give feedback.
-
selective activation only recompute attention. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
--recompute-activations
doesn't make any difference here because by default Megatron-LM uses TransformerEngine backend. And fused attention kernels like flash-attention always do selective activation checkpointing. You can set--transformer-impl local
, and--recompute-activations
would behave as described in paper. Maybe we should document this for clarity?https://github.com/Dao-AILab/flash-attention/blob/641db759ab7168e472909bc9ff1eda4a329de34f/flash_attn/flash_attn_interface.py#L915