Sudden cuda OOM issue #129

wheresmadog · 2023-08-05T06:41:22Z

Context:

Me using UNet2DModel class from HuggingFace, see here.
Been working just so fine until today.
Suddenly the process tries to allocate 56GiB to GPU --> ofc not eligible and causes OOM

Things I've tried:

Literally, ONLY ONE INSTANCE has been pushed to the model. --> batch size or num of instances are not the cause
Reboot the computer --> NOPE! Didn't work
Use debugger to capture the OOM moment --> see below

Can you solve it?

Detailed stats for geeks:
Summary of CUDA memory

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  566398 KB |  595070 KB |    1827 MB |    1274 MB |
|       from large pool |  556332 KB |  585004 KB |    1812 MB |    1269 MB |
|       from small pool |   10066 KB |   10202 KB |      14 MB |       5 MB |
|---------------------------------------------------------------------------|
| Active memory         |  566398 KB |  595070 KB |    1827 MB |    1274 MB |
|       from large pool |  556332 KB |  585004 KB |    1812 MB |    1269 MB |
|       from small pool |   10066 KB |   10202 KB |      14 MB |       5 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  624640 KB |  624640 KB |  624640 KB |       0 B  |
|       from large pool |  612352 KB |  612352 KB |  612352 KB |       0 B  |
|       from small pool |   12288 KB |   12288 KB |   12288 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |   58242 KB |  115586 KB |  538130 KB |  479888 KB |
|       from large pool |   56020 KB |  113364 KB |  524502 KB |  468482 KB |
|       from small pool |    2222 KB |    3595 KB |   13628 KB |   11406 KB |
|---------------------------------------------------------------------------|
| Allocations           |     171    |     183    |    8735    |    8564    |
|       from large pool |      52    |      53    |      82    |      30    |
|       from small pool |     119    |     131    |    8653    |    8534    |
|---------------------------------------------------------------------------|
| Active allocs         |     171    |     183    |    8735    |    8564    |
|       from large pool |      52    |      53    |      82    |      30    |
|       from small pool |     119    |     131    |    8653    |    8534    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      26    |      26    |      26    |       0    |
|       from large pool |      20    |      20    |      20    |       0    |
|       from small pool |       6    |       6    |       6    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      14    |      16    |    3640    |    3626    |
|       from large pool |       8    |       9    |      19    |      11    |
|       from small pool |       6    |       8    |    3621    |    3615    |
|===========================================================================|

Memory stat seems pretty stable, isn't it?
Then all of a sudden, at the fourth iteration of down sampling blocks, the model tries to allocate 56GiB to GPU.
The OOM count over the summary concerns this very issue as this summary is written at the debugging flag.
And there had been no code modification.
torch.no_grad() flag has been up. That is, no further autograd graph has been created to cause this ridiculous memory allocation.

Exception has occurred: RuntimeError
CUDA out of memory. Tried to allocate 56.00 GiB (GPU 0; 23.70 GiB total capacity; 553.12 MiB already allocated; 21.42 GiB free; 610.00 MiB reserved in total by PyTorch)
  File "/home/wheresmadog/Projects/wheresmadog/src/models.py", line 157, in forward
    sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
  File "/home/wheresmadog/Projects/wheresmadog/src/models.py", line 182, in <module>
    res = model(x.cuda(),t.cuda())
RuntimeError: CUDA out of memory. Tried to allocate 56.00 GiB (GPU 0; 23.70 GiB total capacity; 553.12 MiB already allocated; 21.42 GiB free; 610.00 MiB reserved in total by PyTorch)

The text was updated successfully, but these errors were encountered:

wheresmadog · 2023-08-05T06:45:34Z

Pkg versions

absl-py==1.0.0
addict==2.4.0
aiohttp==3.8.4
aiosignal==1.3.1
antlr4-python3-runtime==4.8
anyio==3.5.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1618968359944/work
async-timeout==4.0.2
attrs==21.4.0
av==9.2.0
Babel==2.10.1
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work
beautifulsoup4==4.11.1
black==21.4b2
bleach==5.0.0
cachetools==5.0.0
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.12
click==8.1.3
cloudpickle==2.0.0
cycler==0.11.0
datasets==2.13.1
debugpy @ file:///tmp/build/80754af9/debugpy_1637091796427/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
defusedxml==0.7.1
deprecation==2.1.0
detectron2==0.6+cu111
diffusers==0.18.1
dill==0.3.6
entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1646044401614/work
fastjsonschema==2.15.3
filelock==3.12.2
fonttools==4.31.2
frozenlist==1.4.0
fsspec==2023.6.0
future==0.18.2
fvcore==0.1.5.post20220512
google-auth==2.6.3
google-auth-oauthlib==0.4.6
grpcio==1.51.3
huggingface-hub==0.16.4
hydra-core==1.1.2
idna==3.3
imageio==2.19.1
importlib-metadata==4.11.3
importlib-resources==5.2.3
iopath==0.1.9
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1648898275899/work/dist/ipykernel-6.11.0-py3-none-any.whl
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1648413572172/work
ipython-genutils==0.2.0
ipywidgets==7.7.0
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1649067097809/work
Jinja2==3.1.2
joblib==1.1.0
json5==0.9.8
jsonschema==4.5.1
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1633454794268/work
jupyter-core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1645024265313/work
jupyter-packaging==0.12.0
jupyter-server==1.17.0
jupyterlab==3.4.0
jupyterlab-pygments==0.2.2
jupyterlab-server==2.13.0
jupyterlab-widgets==1.1.0
kiwisolver==1.4.2
lxml==4.9.1
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.1
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1631080358261/work
mistune==0.8.4
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==0.4.3
natsort==8.1.0
nbclassic==0.3.7
nbclient==0.6.3
nbconvert==6.5.0
nbformat==5.4.0
nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1648959695634/work
networkx==2.8.8
notebook==6.4.11
notebook-shim==0.1.0
numpy==1.22.3
oauthlib==3.2.0
odfpy==1.4.1
omegaconf==2.1.2
open3d==0.15.2
opencv-python==4.5.5.64
packaging==21.3
pandas==1.4.2
pandocfilters==1.5.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathspec==0.9.0
pdfminer==20191125
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1602535608087/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
Pillow==9.1.0
portalocker==2.4.0
positional-encodings==6.0.1
prometheus-client==0.14.1
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1644497866770/work
protobuf==3.20.0
psutil @ file:///tmp/build/80754af9/psutil_1612298023621/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
pyarrow==12.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.4
pycparser==2.21
pycryptodome==3.15.0
pydot==1.4.2
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1641580240686/work
pyparsing==3.0.7
pyquaternion==0.9.9
pyrsistent==0.18.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
pytz==2022.1
PyYAML==6.0
pyzmq==19.0.2
regex==2022.4.24
requests==2.27.1
requests-oauthlib==1.3.1
rsa==4.8
safetensors==0.3.1
scikit-learn==1.0.2
scipy==1.8.0
Send2Trash==1.8.0
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
sniffio==1.2.0
soupsieve==2.3.2.post1
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1644872665635/work
tabulate==0.8.9
tdqm==0.0.1
tensorboard==2.12.0
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tensorboardX==2.5
termcolor==1.1.0
terminado==0.13.3
threadpoolctl==3.1.0
tinycss2==1.1.1
tokenizers==0.13.3
toml==0.10.2
tomlkit==0.10.2
torch==1.8.2+cu111
torch-fidelity==0.3.0
torch-tb-profiler==0.4.0
torchaudio==0.8.2
torchmetrics==0.11.3
torchvision==0.9.2+cu111
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1610094706440/work
tqdm==4.64.0
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1635260543454/work
transformers==4.30.2
typing_extensions==4.1.1
urllib3==1.26.9
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work
webencodings==0.5.1
websocket-client==1.3.2
Werkzeug==2.1.1
widgetsnbextension==3.6.0
xxhash==3.2.0
yacs==0.1.8
yarl==1.9.2
zipp==3.8.0

Driver version:
470.199.02

JiHun-Lim · 2023-08-05T06:55:01Z

우선 wandb를 써서 https://wandb.ai/wandb/common-ml-errors/reports/How-To-Use-GPU-with-PyTorch---VmlldzozMzAxMDk
여기 있는 것처럼 gpu memory를 tracking 해보는 건 어떨까요?

pinga999 · 2023-08-05T09:25:13Z

I want to test same code at my local workstation for check GPU usage

Could U share code snippet or short code for run similar situation? @wheresmadog

wheresmadog · 2023-08-05T12:31:16Z

As for reproduction, you can run:

import torch
from diffusers import UNet2DModel

model = UNet2DModel.from_pretrained('google/ddpm-cat-256').cuda()

x = torch.randn((1,3,256,256))
t = torch.tensor([0], dtype=torch.int32)

res = model(x.cuda(), t.cuda()) # this part is the blocakage

However, I'm skeptical the same error will occur in different machines if usual, because the code snippet actually uses nothing but official release of HF codes.

wheresmadog · 2023-08-05T12:36:21Z

우선 wandb를 써서 https://wandb.ai/wandb/common-ml-errors/reports/How-To-Use-GPU-with-PyTorch---VmlldzozMzAxMDk
여기 있는 것처럼 gpu memory를 tracking 해보는 건 어떨까요?

Despite wandb is not in use, torch.cuda is on par with it, I believe.

JiHun-Lim · 2023-08-05T12:46:06Z

음 그냥 이렇게 시간에 따라서 memory가 어떻게 사용되는지 체크할 수 있는데
어떤식으로 memory가 변하는지 확인하면 원인을 찾는데에 도움이 되지 않을까 해서 추천드렸습니다 ㅎㅎ

출처 : huggingface/transformers/issues/13019

cth127 · 2023-08-05T12:47:42Z

res = model(x.cuda(), t.cuda())

이걸

x = x.to('cuda')
t = t.to('cuda')
res = model(x, t)

이렇게 해도 에러가 날까요?

wheresmadog · 2023-08-05T17:44:47Z

음 그냥 이렇게 시간에 따라서 memory가 어떻게 사용되는지 체크할 수 있는데 어떤식으로 memory가 변하는지 확인하면 원인을 찾는데에 도움이 되지 않을까 해서 추천드렸습니다 ㅎㅎ

출처 : huggingface/transformers/issues/13019

First of all, thanks for your thoughtful answer.

The scenario of the reference issue is when a model keeps running over and over for hyperparameter tuning. Which implies that it can at least successfully iterate the model a couple of times. Meanwhile, I have a problem that not even one instance is allowed to make it through. Once again, I make clear that OOM occurs in the middle of a certain layer rather than after a few iterations of entire layers of a model.

res = model(x.cuda(), t.cuda())

이걸

x = x.to('cuda')
t = t.to('cuda')
res = model(x, t)

이렇게 해도 에러가 날까요?

Yeap, still not working. But I found from which layer the error occurs where a GPU memory allocator suddenly strikes out.

wheresmadog · 2023-08-05T17:46:54Z

If anyone tries the code snippet and is unable to reproduce the problem I have, then it looks likely to be a H/W issue rather than a code error. (I hope my GPU is not melted down or something.)

pinga999 · 2023-08-05T18:50:05Z

I'd like to help, but I'm sorry to say this :(

I want to run the code-snippet, but I can't run the code properly because I repeatedly get errors like
cannot import name 'CLIPTextModelWithProjection' from 'transformers'

P.S. I also want to ask if you have experienced this import error.

JiHun-Lim · 2023-08-06T00:34:36Z

우선 저는 문제 없이 진행되는 것 같습니다! gpu 메모리 정보도 같이 올려보도록 하겠습니다.

wheresmadog · 2023-08-06T08:52:59Z

I'd like to help, but I'm sorry to say this :(

I want to run the code-snippet, but I can't run the code properly because I repeatedly get errors like cannot import name 'CLIPTextModelWithProjection' from 'transformers'

P.S. I also want to ask if you have experienced this import error.

Well, that's not the case I've faced before. Did you install from the requirement uploaded in the comment?

wheresmadog · 2023-08-06T08:54:11Z

Thanks for your confirmation. This advocates that potential suspect is H/W.

wheresmadog · 2023-08-06T08:57:45Z

As for your curiosity, I've figured out from which part of the model such memory usage spikes up. It turns out attention block tries to allocate such a massive amount of memory.

You can see this block is not a huge one, and almost any GPU can run on it - again, batch size is not an issue as it's set to 1.

cth127 · 2023-08-06T09:12:10Z

위 snippet을 구글 colab에 돌렸을 때 사용 g메모리 3기가 정도라서... 재현이 안 되네요ㄷㄷ

wheresmadog · 2023-08-06T09:12:53Z

LOL, now that it requires over 800GiB. Am I dealing with some sort of LLM?

cth127 · 2023-08-06T09:22:47Z

maybe related. torch 버전 업데이트 해보심이? huggingface/diffusers#4159

wheresmadog · 2023-08-06T10:17:20Z

maybe related. torch 버전 업데이트 해보심이? huggingface/diffusers#4159

It turns out compatibility issue. Having updated torch 2.0 resolved the issue.

Thank you.

P.S. But why did it work before?

pinga999 · 2023-08-06T15:34:25Z

It turns out compatibility issue. Having updated torch 2.0 resolved the issue.

Thank you.

P.S. But why did it work before?

pinga999 changed the title ~~Sudden cudu OOM issue~~ Sudden cuda OOM issue Aug 5, 2023

wheresmadog closed this as completed Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden cuda OOM issue #129

Sudden cuda OOM issue #129

wheresmadog commented Aug 5, 2023 •

edited

Loading

wheresmadog commented Aug 5, 2023 •

edited

Loading

JiHun-Lim commented Aug 5, 2023

pinga999 commented Aug 5, 2023

wheresmadog commented Aug 5, 2023 •

edited

Loading

wheresmadog commented Aug 5, 2023

JiHun-Lim commented Aug 5, 2023 •

edited

Loading

cth127 commented Aug 5, 2023 •

edited

Loading

wheresmadog commented Aug 5, 2023

wheresmadog commented Aug 5, 2023 •

edited

Loading

pinga999 commented Aug 5, 2023

JiHun-Lim commented Aug 6, 2023 •

edited

Loading

wheresmadog commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

cth127 commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

cth127 commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

pinga999 commented Aug 6, 2023

Sudden cuda OOM issue #129

Sudden cuda OOM issue #129

Comments

wheresmadog commented Aug 5, 2023 • edited Loading

wheresmadog commented Aug 5, 2023 • edited Loading

JiHun-Lim commented Aug 5, 2023

pinga999 commented Aug 5, 2023

wheresmadog commented Aug 5, 2023 • edited Loading

wheresmadog commented Aug 5, 2023

JiHun-Lim commented Aug 5, 2023 • edited Loading

cth127 commented Aug 5, 2023 • edited Loading

wheresmadog commented Aug 5, 2023

wheresmadog commented Aug 5, 2023 • edited Loading

pinga999 commented Aug 5, 2023

JiHun-Lim commented Aug 6, 2023 • edited Loading

wheresmadog commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

cth127 commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

cth127 commented Aug 6, 2023

wheresmadog commented Aug 6, 2023

pinga999 commented Aug 6, 2023

wheresmadog commented Aug 5, 2023 •

edited

Loading

wheresmadog commented Aug 5, 2023 •

edited

Loading

wheresmadog commented Aug 5, 2023 •

edited

Loading

JiHun-Lim commented Aug 5, 2023 •

edited

Loading

cth127 commented Aug 5, 2023 •

edited

Loading

wheresmadog commented Aug 5, 2023 •

edited

Loading

JiHun-Lim commented Aug 6, 2023 •

edited

Loading