-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leakage with USP and Transformer Blocks #112
Comments
Issue Description:I’ve been testing the model on an A100 40GB GPU with the following configuration:
This setup leads to GPU memory statistics as described, but I encountered a memory release issue. When I switch the values to However, regardless of whether I use Have you tested inference performance with long sequences like this? It would be great to understand how this issue can be addressed or optimized for very large input sequences. |
Thank you for your insightful analysis. Indeed, we have previously encountered similar memory leak issues, and this time I will attempt to improve functions like As a temporary workaround, setting |
Thank you for your thoughtful response. Setting |
I will spend some time on the memory issue. If you have any progress, feel free to continue the discussion in this issue, and you're also welcome to submit a PR. |
@baifanxxx You applied USP in training or inference only? |
Only in inference. |
Could please try this solution? I hardly build a test script to reproduce the memory leak issue. Maybe it existing when applied with other communications. For example allgather in your code? TORCH_NCCL_AVOID_RECORD_STREAMS=1 |
Hi,
First of all, great work on the project! However, I’ve encountered an issue with memory release when using USP. Specifically, I’m using USP for end-to-end sequence parallelism outside the multi-layer Transformer blocks. After processing all Transformer blocks, the final output is gathered via
all_gather
. Here is a simplified version of the code:However, I’ve noticed that the GPU memory usage keeps accumulating over time and isn’t properly released. I can provide a pickle file with memory statistics, which can be viewed on [PyTorch Memory Visualization](https://pytorch.ac.cn/memory_viz). The pickle file is
gpu_mem.zip
Upon analysis, I observed that memory created by
torch.empty
on line 94 inring/utils.py
cannot be released properly. Additionally, several operations liketensor.to(dtype)
,all_to_all
, and others also seem to have issues with memory not being freed. I suspect that this may be related to the use of USP, rather than being a problem with any single operation.If you have any insights or suggestions that could help resolve this issue, I would greatly appreciate it!
Thanks!
The text was updated successfully, but these errors were encountered: