[BUG] selective activation recompute only decrease little of GPU memory usage during training #1253

bugm · 2024-10-18T03:52:17Z

bugm
Oct 18, 2024

Describe the bug
According to the paper https://arxiv.org/abs/2205.05198. The normal activation memory for a transformed based model in each layer can be calculated as

and with the selective activation recompute, it can be decreased to

with my training set,

with tp =1 and pp =1, I expected when i use --recompute-activations, the GPU memory usage for storing activation should only be about 34 / (34+80) = 30% of that with no activation recompute applied.

Here are some info about the GPU memory usage
with --recompute-activations

without --recompute-activations

I notice the max_memory allocated during training only decreased from 25.52GB to 24.94 GB.

Expected behavior
The max_memory allocated during training should decrease more.

Environment (please complete the following information):

Megatron-LM commit ID
PyTorch 2.4.1
CUDA version 12.5
NCCL version 2.20.5

Additional context
According to the formula above, with b = 12 s =1024 h =1024 L= 20 a=16 t=1, the original activation memory should be around 32GB, plus the memory for model states , which is about 7.3 GB for a 0.43B parameters model, which should be around 40GB even not take the temporary buffers and unusable fragment memory into account. That is much bigger than the max_memory allocated without activation recomputing, So I wonder the Megatron-LM has done some optimize here?
And why the max_memory allocated only changes little with/without --recompute-activations （use selective activation as default according to the doc）

Answered by jpy794

Nov 15, 2024

--recompute-activations doesn't make any difference here because by default Megatron-LM uses TransformerEngine backend. And fused attention kernels like flash-attention always do selective activation checkpointing. You can set --transformer-impl local, and --recompute-activations would behave as described in paper. Maybe we should document this for clarity?
https://github.com/Dao-AILab/flash-attention/blob/641db759ab7168e472909bc9ff1eda4a329de34f/flash_attn/flash_attn_interface.py#L915

View full answer

elliottnv · 2024-10-23T21:26:39Z

elliottnv
Oct 23, 2024
Collaborator

Thanks for sharing the issue. We currently do not have bandwidth to investigate more, and we will rely on the community to explore more on the issue.

0 replies

wplf · 2024-11-06T08:13:24Z

wplf
Nov 6, 2024

selective activation only recompute attention.
If you want to save more memory, you can try --full-recompute and recompute the whole transformer layer.

0 replies

jpy794 · 2024-11-15T11:14:14Z

jpy794
Nov 15, 2024

--recompute-activations doesn't make any difference here because by default Megatron-LM uses TransformerEngine backend. And fused attention kernels like flash-attention always do selective activation checkpointing. You can set --transformer-impl local, and --recompute-activations would behave as described in paper. Maybe we should document this for clarity?
https://github.com/Dao-AILab/flash-attention/blob/641db759ab7168e472909bc9ff1eda4a329de34f/flash_attn/flash_attn_interface.py#L915

8 replies

eliird Dec 22, 2024

From the comment above and reading the flash attention paper again what I understood was flash attention should free memory equivalent to that of selective recompute. The image in the original paper shows that using full recomputations should reduce memory usage further by quite a big factor .

I am trying to run llama3-8b model with dp=1, tp=8 and pp=1 with batch size 1. And all I am seeing is change in memory from 21GB per GPU to 20.7GB. I am using 8xH100, which is quite different from what the paper reports. And I am trying to understand why?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] selective activation recompute only decrease little of GPU memory usage during training #1253

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[BUG] selective activation recompute only decrease little of GPU memory usage during training #1253

Replies: 3 comments · 8 replies

elliottnv Oct 23, 2024 Collaborator

Replies: 3 comments 8 replies

elliottnv
Oct 23, 2024
Collaborator