Memory Leakage with USP and Transformer Blocks #112

baifanxxx · 2024-12-12T14:31:35Z

Hi,

First of all, great work on the project! However, I’ve encountered an issue with memory release when using USP. Specifically, I’m using USP for end-to-end sequence parallelism outside the multi-layer Transformer blocks. After processing all Transformer blocks, the final output is gathered via all_gather. Here is a simplified version of the code:

def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.Tensor:
    rank = dist.get_rank()
    world_size = dist.get_world_size()

    local_hidden_states = hidden_states.chunk(world_size, dim=0)[rank].detach().clone()
    local_hidden_states = self.patch_embed(local_hidden_states)
    rotary_pos_emb = self.rot_pos_emb(grid_thw)
    local_rotary_pos_emb = rotary_pos_emb.chunk(world_size, dim=0)[rank].detach().clone()

    for blk in self.blocks:
        local_hidden_states = blk(local_hidden_states, rotary_pos_emb=local_rotary_pos_emb)

    S, D = local_hidden_states.shape[:]
    hidden_states_gather = torch.zeros(world_size * S, D, dtype=local_hidden_states.dtype, device=local_hidden_states.device)
    dist.all_gather_into_tensor(hidden_states_gather, local_hidden_states)

    return hidden_states_gather

However, I’ve noticed that the GPU memory usage keeps accumulating over time and isn’t properly released. I can provide a pickle file with memory statistics, which can be viewed on [PyTorch Memory Visualization](https://pytorch.ac.cn/memory_viz). The pickle file is
gpu_mem.zip

Upon analysis, I observed that memory created by torch.empty on line 94 in ring/utils.py cannot be released properly. Additionally, several operations like tensor.to(dtype), all_to_all, and others also seem to have issues with memory not being freed. I suspect that this may be related to the use of USP, rather than being a problem with any single operation.

If you have any insights or suggestions that could help resolve this issue, I would greatly appreciate it!

Thanks!

The text was updated successfully, but these errors were encountered:

baifanxxx · 2024-12-13T02:21:50Z

Issue Description:

I’ve been testing the model on an A100 40GB GPU with the following configuration:

Input sequence length: 114464
Hidden size: 1280
sp_ulysses_degree = 1
sp_ring_degree = 2

This setup leads to GPU memory statistics as described, but I encountered a memory release issue. When I switch the values to sp_ulysses_degree = 2 and sp_ring_degree = 1, the memory issue disappears, and everything works fine. I can provide the pickle file from this scenario if needed.
114464_e2e_u2r1_mem.zip

However, regardless of whether I use ulysses or ring parallelism, when facing very long input sequences like the one above (length = 114464), the inference time is significantly slower compared to when sequence parallelism is not used. I would like to discuss this issue with the authors, as it seems related to the way USP (Universal Sequence Parallelism) handles extremely long sequence inputs.

Have you tested inference performance with long sequences like this? It would be great to understand how this issue can be addressed or optimized for very large input sequences.

feifeibear · 2024-12-13T13:14:23Z

Thank you for your insightful analysis. Indeed, we have previously encountered similar memory leak issues, and this time I will attempt to improve functions like torch.empty.

As a temporary workaround, setting use_sync=True can eliminate the memory leak. However, it is reported to damage the performance on some GPUs.
https://github.com/feifeibear/long-context-attention/blob/main/yunchang/hybrid/attn_layer.py#L22

baifanxxx · 2024-12-14T01:07:40Z

Thank you for your thoughtful response. Setting use_sync=True does indeed temporarily address the memory leak issue; however, it introduces additional latency due to increased synchronization overhead. We hope to explore more efficient solutions to tackle the memory issue effectively. Once again, thank you for your attention and valuable feedback.

feifeibear · 2024-12-14T10:55:58Z

Thank you for your thoughtful response. Setting use_sync=True does indeed temporarily address the memory leak issue; however, it introduces additional latency due to increased synchronization overhead. We hope to explore more efficient solutions to tackle the memory issue effectively. Once again, thank you for your attention and valuable feedback.

I will spend some time on the memory issue. If you have any progress, feel free to continue the discussion in this issue, and you're also welcome to submit a PR.

feifeibear · 2024-12-16T09:35:36Z

@baifanxxx You applied USP in training or inference only?

baifanxxx · 2024-12-16T11:36:24Z

Only in inference.

feifeibear · 2024-12-16T14:33:24Z

Could please try this solution? I hardly build a test script to reproduce the memory leak issue. Maybe it existing when applied with other communications. For example allgather in your code?

TORCH_NCCL_AVOID_RECORD_STREAMS=1

feifeibear self-assigned this Dec 13, 2024

feifeibear added the bug Something isn't working label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leakage with USP and Transformer Blocks #112

Memory Leakage with USP and Transformer Blocks #112

baifanxxx commented Dec 12, 2024 •

edited

Loading

baifanxxx commented Dec 13, 2024

feifeibear commented Dec 13, 2024 •

edited

Loading

baifanxxx commented Dec 14, 2024

feifeibear commented Dec 14, 2024

feifeibear commented Dec 16, 2024

baifanxxx commented Dec 16, 2024

feifeibear commented Dec 16, 2024

Memory Leakage with USP and Transformer Blocks #112

Memory Leakage with USP and Transformer Blocks #112

Comments

baifanxxx commented Dec 12, 2024 • edited Loading

baifanxxx commented Dec 13, 2024

Issue Description:

feifeibear commented Dec 13, 2024 • edited Loading

baifanxxx commented Dec 14, 2024

feifeibear commented Dec 14, 2024

feifeibear commented Dec 16, 2024

baifanxxx commented Dec 16, 2024

feifeibear commented Dec 16, 2024

baifanxxx commented Dec 12, 2024 •

edited

Loading

feifeibear commented Dec 13, 2024 •

edited

Loading