Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden cuda OOM issue #129

Closed
wheresmadog opened this issue Aug 5, 2023 · 19 comments
Closed

Sudden cuda OOM issue #129

wheresmadog opened this issue Aug 5, 2023 · 19 comments

Comments

@wheresmadog
Copy link
Contributor

wheresmadog commented Aug 5, 2023

Context:

  1. Me using UNet2DModel class from HuggingFace, see here.
  2. Been working just so fine until today.
  3. Suddenly the process tries to allocate 56GiB to GPU --> ofc not eligible and causes OOM

Things I've tried:

  1. Literally, ONLY ONE INSTANCE has been pushed to the model. --> batch size or num of instances are not the cause
  2. Reboot the computer --> NOPE! Didn't work
  3. Use debugger to capture the OOM moment --> see below

Can you solve it?

Detailed stats for geeks:
Summary of CUDA memory

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  566398 KB |  595070 KB |    1827 MB |    1274 MB |
|       from large pool |  556332 KB |  585004 KB |    1812 MB |    1269 MB |
|       from small pool |   10066 KB |   10202 KB |      14 MB |       5 MB |
|---------------------------------------------------------------------------|
| Active memory         |  566398 KB |  595070 KB |    1827 MB |    1274 MB |
|       from large pool |  556332 KB |  585004 KB |    1812 MB |    1269 MB |
|       from small pool |   10066 KB |   10202 KB |      14 MB |       5 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  624640 KB |  624640 KB |  624640 KB |       0 B  |
|       from large pool |  612352 KB |  612352 KB |  612352 KB |       0 B  |
|       from small pool |   12288 KB |   12288 KB |   12288 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |   58242 KB |  115586 KB |  538130 KB |  479888 KB |
|       from large pool |   56020 KB |  113364 KB |  524502 KB |  468482 KB |
|       from small pool |    2222 KB |    3595 KB |   13628 KB |   11406 KB |
|---------------------------------------------------------------------------|
| Allocations           |     171    |     183    |    8735    |    8564    |
|       from large pool |      52    |      53    |      82    |      30    |
|       from small pool |     119    |     131    |    8653    |    8534    |
|---------------------------------------------------------------------------|
| Active allocs         |     171    |     183    |    8735    |    8564    |
|       from large pool |      52    |      53    |      82    |      30    |
|       from small pool |     119    |     131    |    8653    |    8534    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      26    |      26    |      26    |       0    |
|       from large pool |      20    |      20    |      20    |       0    |
|       from small pool |       6    |       6    |       6    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      14    |      16    |    3640    |    3626    |
|       from large pool |       8    |       9    |      19    |      11    |
|       from small pool |       6    |       8    |    3621    |    3615    |
|===========================================================================|

Memory stat seems pretty stable, isn't it?
Then all of a sudden, at the fourth iteration of down sampling blocks, the model tries to allocate 56GiB to GPU.
The OOM count over the summary concerns this very issue as this summary is written at the debugging flag.
And there had been no code modification.
torch.no_grad() flag has been up. That is, no further autograd graph has been created to cause this ridiculous memory allocation.

Exception has occurred: RuntimeError
CUDA out of memory. Tried to allocate 56.00 GiB (GPU 0; 23.70 GiB total capacity; 553.12 MiB already allocated; 21.42 GiB free; 610.00 MiB reserved in total by PyTorch)
  File "/home/wheresmadog/Projects/wheresmadog/src/models.py", line 157, in forward
    sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
  File "/home/wheresmadog/Projects/wheresmadog/src/models.py", line 182, in <module>
    res = model(x.cuda(),t.cuda())
RuntimeError: CUDA out of memory. Tried to allocate 56.00 GiB (GPU 0; 23.70 GiB total capacity; 553.12 MiB already allocated; 21.42 GiB free; 610.00 MiB reserved in total by PyTorch)
@wheresmadog
Copy link
Contributor Author

wheresmadog commented Aug 5, 2023

Pkg versions

absl-py==1.0.0
addict==2.4.0
aiohttp==3.8.4
aiosignal==1.3.1
antlr4-python3-runtime==4.8
anyio==3.5.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1618968359944/work
async-timeout==4.0.2
attrs==21.4.0
av==9.2.0
Babel==2.10.1
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work
beautifulsoup4==4.11.1
black==21.4b2
bleach==5.0.0
cachetools==5.0.0
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.12
click==8.1.3
cloudpickle==2.0.0
cycler==0.11.0
datasets==2.13.1
debugpy @ file:///tmp/build/80754af9/debugpy_1637091796427/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
defusedxml==0.7.1
deprecation==2.1.0
detectron2==0.6+cu111
diffusers==0.18.1
dill==0.3.6
entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1646044401614/work
fastjsonschema==2.15.3
filelock==3.12.2
fonttools==4.31.2
frozenlist==1.4.0
fsspec==2023.6.0
future==0.18.2
fvcore==0.1.5.post20220512
google-auth==2.6.3
google-auth-oauthlib==0.4.6
grpcio==1.51.3
huggingface-hub==0.16.4
hydra-core==1.1.2
idna==3.3
imageio==2.19.1
importlib-metadata==4.11.3
importlib-resources==5.2.3
iopath==0.1.9
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1648898275899/work/dist/ipykernel-6.11.0-py3-none-any.whl
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1648413572172/work
ipython-genutils==0.2.0
ipywidgets==7.7.0
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1649067097809/work
Jinja2==3.1.2
joblib==1.1.0
json5==0.9.8
jsonschema==4.5.1
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1633454794268/work
jupyter-core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1645024265313/work
jupyter-packaging==0.12.0
jupyter-server==1.17.0
jupyterlab==3.4.0
jupyterlab-pygments==0.2.2
jupyterlab-server==2.13.0
jupyterlab-widgets==1.1.0
kiwisolver==1.4.2
lxml==4.9.1
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.1
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1631080358261/work
mistune==0.8.4
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==0.4.3
natsort==8.1.0
nbclassic==0.3.7
nbclient==0.6.3
nbconvert==6.5.0
nbformat==5.4.0
nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1648959695634/work
networkx==2.8.8
notebook==6.4.11
notebook-shim==0.1.0
numpy==1.22.3
oauthlib==3.2.0
odfpy==1.4.1
omegaconf==2.1.2
open3d==0.15.2
opencv-python==4.5.5.64
packaging==21.3
pandas==1.4.2
pandocfilters==1.5.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathspec==0.9.0
pdfminer==20191125
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1602535608087/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
Pillow==9.1.0
portalocker==2.4.0
positional-encodings==6.0.1
prometheus-client==0.14.1
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1644497866770/work
protobuf==3.20.0
psutil @ file:///tmp/build/80754af9/psutil_1612298023621/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
pyarrow==12.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.4
pycparser==2.21
pycryptodome==3.15.0
pydot==1.4.2
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1641580240686/work
pyparsing==3.0.7
pyquaternion==0.9.9
pyrsistent==0.18.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
pytz==2022.1
PyYAML==6.0
pyzmq==19.0.2
regex==2022.4.24
requests==2.27.1
requests-oauthlib==1.3.1
rsa==4.8
safetensors==0.3.1
scikit-learn==1.0.2
scipy==1.8.0
Send2Trash==1.8.0
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
sniffio==1.2.0
soupsieve==2.3.2.post1
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1644872665635/work
tabulate==0.8.9
tdqm==0.0.1
tensorboard==2.12.0
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tensorboardX==2.5
termcolor==1.1.0
terminado==0.13.3
threadpoolctl==3.1.0
tinycss2==1.1.1
tokenizers==0.13.3
toml==0.10.2
tomlkit==0.10.2
torch==1.8.2+cu111
torch-fidelity==0.3.0
torch-tb-profiler==0.4.0
torchaudio==0.8.2
torchmetrics==0.11.3
torchvision==0.9.2+cu111
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1610094706440/work
tqdm==4.64.0
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1635260543454/work
transformers==4.30.2
typing_extensions==4.1.1
urllib3==1.26.9
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work
webencodings==0.5.1
websocket-client==1.3.2
Werkzeug==2.1.1
widgetsnbextension==3.6.0
xxhash==3.2.0
yacs==0.1.8
yarl==1.9.2
zipp==3.8.0

Driver version:
470.199.02

@JiHun-Lim
Copy link

우선 wandb를 써서 https://wandb.ai/wandb/common-ml-errors/reports/How-To-Use-GPU-with-PyTorch---VmlldzozMzAxMDk
여기 있는 것처럼 gpu memory를 tracking 해보는 건 어떨까요?

@pinga999
Copy link
Contributor

pinga999 commented Aug 5, 2023

I want to test same code at my local workstation for check GPU usage

Could U share code snippet or short code for run similar situation? @wheresmadog

@wheresmadog
Copy link
Contributor Author

wheresmadog commented Aug 5, 2023

As for reproduction, you can run:

import torch
from diffusers import UNet2DModel

model = UNet2DModel.from_pretrained('google/ddpm-cat-256').cuda()

x = torch.randn((1,3,256,256))
t = torch.tensor([0], dtype=torch.int32)

res = model(x.cuda(), t.cuda()) # this part is the blocakage

However, I'm skeptical the same error will occur in different machines if usual, because the code snippet actually uses nothing but official release of HF codes.

@wheresmadog
Copy link
Contributor Author

우선 wandb를 써서 https://wandb.ai/wandb/common-ml-errors/reports/How-To-Use-GPU-with-PyTorch---VmlldzozMzAxMDk
여기 있는 것처럼 gpu memory를 tracking 해보는 건 어떨까요?

Despite wandb is not in use, torch.cuda is on par with it, I believe.

@JiHun-Lim
Copy link

JiHun-Lim commented Aug 5, 2023

음 그냥 이렇게 시간에 따라서 memory가 어떻게 사용되는지 체크할 수 있는데
어떤식으로 memory가 변하는지 확인하면 원인을 찾는데에 도움이 되지 않을까 해서 추천드렸습니다 ㅎㅎ

출처 : huggingface/transformers/issues/13019

@cth127
Copy link
Contributor

cth127 commented Aug 5, 2023

res = model(x.cuda(), t.cuda())

이걸

x = x.to('cuda')
t = t.to('cuda')
res = model(x, t)

이렇게 해도 에러가 날까요?

@pinga999 pinga999 changed the title Sudden cudu OOM issue Sudden cuda OOM issue Aug 5, 2023
@wheresmadog
Copy link
Contributor Author

음 그냥 이렇게 시간에 따라서 memory가 어떻게 사용되는지 체크할 수 있는데 어떤식으로 memory가 변하는지 확인하면 원인을 찾는데에 도움이 되지 않을까 해서 추천드렸습니다 ㅎㅎ

출처 : huggingface/transformers/issues/13019

First of all, thanks for your thoughtful answer.

The scenario of the reference issue is when a model keeps running over and over for hyperparameter tuning. Which implies that it can at least successfully iterate the model a couple of times. Meanwhile, I have a problem that not even one instance is allowed to make it through. Once again, I make clear that OOM occurs in the middle of a certain layer rather than after a few iterations of entire layers of a model.


res = model(x.cuda(), t.cuda())

이걸

x = x.to('cuda')
t = t.to('cuda')
res = model(x, t)

이렇게 해도 에러가 날까요?

Yeap, still not working. But I found from which layer the error occurs where a GPU memory allocator suddenly strikes out.

@wheresmadog
Copy link
Contributor Author

wheresmadog commented Aug 5, 2023

If anyone tries the code snippet and is unable to reproduce the problem I have, then it looks likely to be a H/W issue rather than a code error. (I hope my GPU is not melted down or something.)

@pinga999
Copy link
Contributor

pinga999 commented Aug 5, 2023

I'd like to help, but I'm sorry to say this :(

I want to run the code-snippet, but I can't run the code properly because I repeatedly get errors like
cannot import name 'CLIPTextModelWithProjection' from 'transformers'

P.S. I also want to ask if you have experienced this import error.

@JiHun-Lim
Copy link

JiHun-Lim commented Aug 6, 2023

우선 저는 문제 없이 진행되는 것 같습니다! gpu 메모리 정보도 같이 올려보도록 하겠습니다.

@wheresmadog
Copy link
Contributor Author

I'd like to help, but I'm sorry to say this :(

I want to run the code-snippet, but I can't run the code properly because I repeatedly get errors like cannot import name 'CLIPTextModelWithProjection' from 'transformers'

P.S. I also want to ask if you have experienced this import error.

Well, that's not the case I've faced before. Did you install from the requirement uploaded in the comment?

@wheresmadog
Copy link
Contributor Author

Thanks for your confirmation. This advocates that potential suspect is H/W.

@wheresmadog
Copy link
Contributor Author

As for your curiosity, I've figured out from which part of the model such memory usage spikes up. It turns out attention block tries to allocate such a massive amount of memory.

image

You can see this block is not a huge one, and almost any GPU can run on it - again, batch size is not an issue as it's set to 1.

@cth127
Copy link
Contributor

cth127 commented Aug 6, 2023

위 snippet을 구글 colab에 돌렸을 때 사용 g메모리 3기가 정도라서... 재현이 안 되네요ㄷㄷ

@wheresmadog
Copy link
Contributor Author

image

LOL, now that it requires over 800GiB. Am I dealing with some sort of LLM?

@cth127
Copy link
Contributor

cth127 commented Aug 6, 2023

maybe related. torch 버전 업데이트 해보심이? huggingface/diffusers#4159

@wheresmadog
Copy link
Contributor Author

maybe related. torch 버전 업데이트 해보심이? huggingface/diffusers#4159

It turns out compatibility issue. Having updated torch 2.0 resolved the issue.

Thank you.

P.S. But why did it work before?

@pinga999
Copy link
Contributor

pinga999 commented Aug 6, 2023

It turns out compatibility issue. Having updated torch 2.0 resolved the issue.

Thank you.

P.S. But why did it work before?

5195746bf50ed0eabbe61d4aa1cdd22a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants