-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden cuda OOM issue #129
Comments
Pkg versions
Driver version: |
우선 wandb를 써서 https://wandb.ai/wandb/common-ml-errors/reports/How-To-Use-GPU-with-PyTorch---VmlldzozMzAxMDk |
I want to test same code at my local workstation for check GPU usage Could U share code snippet or short code for run similar situation? @wheresmadog |
As for reproduction, you can run:
However, I'm skeptical the same error will occur in different machines if usual, because the code snippet actually uses nothing but official release of HF codes. |
Despite |
음 그냥 이렇게 시간에 따라서 memory가 어떻게 사용되는지 체크할 수 있는데 |
이걸
이렇게 해도 에러가 날까요? |
If anyone tries the code snippet and is unable to reproduce the problem I have, then it looks likely to be a H/W issue rather than a code error. (I hope my GPU is not melted down or something.) |
I'd like to help, but I'm sorry to say this :( I want to run the code-snippet, but I can't run the code properly because I repeatedly get errors like P.S. I also want to ask if you have experienced this import error. |
Well, that's not the case I've faced before. Did you install from the requirement uploaded in the comment? |
Thanks for your confirmation. This advocates that potential suspect is H/W. |
위 snippet을 구글 colab에 돌렸을 때 사용 g메모리 3기가 정도라서... 재현이 안 되네요ㄷㄷ |
maybe related. torch 버전 업데이트 해보심이? huggingface/diffusers#4159 |
It turns out compatibility issue. Having updated torch 2.0 resolved the issue. Thank you. P.S. But why did it work before? |
Context:
UNet2DModel
class from HuggingFace, see here.Things I've tried:
Can you solve it?
Detailed stats for geeks:
Summary of CUDA memory
Memory stat seems pretty stable, isn't it?
Then all of a sudden, at the fourth iteration of down sampling blocks, the model tries to allocate 56GiB to GPU.
The OOM count over the summary concerns this very issue as this summary is written at the debugging flag.
And there had been no code modification.
torch.no_grad()
flag has been up. That is, no further autograd graph has been created to cause this ridiculous memory allocation.The text was updated successfully, but these errors were encountered: