RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch) #124

haomayang1126 · 2021-07-24T09:59:29Z

输入命令刚开始执行，显卡的内存就被占满了，有时候在第一轮出错，有时候在第三轮，
num_works 设置了4，2，0都是同样的问题，
环境
python 3.8
pytorch1.8.1
cuda10.1

==============================================================================
2021-07-24 17:55:22 | INFO | yolox.core.trainer:188 - ---> start train epoch1
2021-07-24 17:55:26 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
2021-07-24 17:55:28 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
2021-07-24 17:55:30 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
2021-07-24 17:55:37 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 10/40, mem: 4660Mb, iter_time: 1.570s, data_time: 0.867s, total_loss: 11.0, iou_loss: 3.0, l1_loss: 0.0, conf_
loss: 5.7, cls_loss: 2.3, lr: 1.953e-06, size: 640, ETA: 0:31:08
2021-07-24 17:55:43 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
2021-07-24 17:55:51 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 20/40, mem: 4660Mb, iter_time: 1.333s, data_time: 0.762s, total_loss: 10.1, iou_loss: 2.8, l1_loss: 0.0, conf_
loss: 4.5, cls_loss: 2.8, lr: 7.813e-06, size: 576, ETA: 0:28:32
2021-07-24 17:55:53 | INFO | yolox.core.trainer:183 - Training of experiment is done and the best AP is 0.00
2021-07-24 17:55:53 | ERROR | yolox.core.launch:73 - An error has been caught in function 'launch', process 'MainProcess' (5488), thread 'MainThread' (6852):
Traceback (most recent call last):

File "tools\train.py", line 111, in
launch(
└ <function launch at 0x00000126EC829E50>

File "g:\pythonproject\yolox-main\yolox\core\launch.py", line 73, in launch
main_func(*args)
│ └ (╒══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...
└ <function main at 0x00000126EEA76DC0>

File "tools\train.py", line 101, in main
trainer.train()
│ └ <function Trainer.train at 0x00000126EDDBCD30>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 70, in train
self.train_in_epoch()
│ └ <function Trainer.train_in_epoch at 0x00000126EEA44F70>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 79, in train_in_epoch
self.train_in_iter()
│ └ <function Trainer.train_in_iter at 0x00000126EEA55280>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 85, in train_in_iter
self.train_one_iter()
│ └ <function Trainer.train_one_iter at 0x00000126EEA55310>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 91, in train_one_iter
inps, targets = self.prefetcher.next()
│ │ └ <function DataPrefetcher.next at 0x00000126EDDBC310>
│ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 48, in next
self.preload()
│ └ <function DataPrefetcher.preload at 0x00000126EDDBC280>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 37, in preload
self.input_cuda()
│ └ <bound method DataPrefetcher._input_cuda_for_image of <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 52, in _input_cuda_for_image
self.next_input = self.next_input.cuda(non_blocking=True)
│ │ │ │ └ <method 'cuda' of 'torch._C._TensorBase' objects>
│ │ │ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424],
│ │ │ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ...
│ │ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
│ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424],
│ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ...
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch)

(Swin) G:\Pythonproject\YOLOX-main>

Joker316701882 · 2021-07-25T02:07:51Z

It's a known error #91 . We are working on it now.

1VeniVediVeci1 · 2021-07-25T03:29:15Z

试试训练时把-o指令去掉

haomayang1126 · 2021-07-26T09:20:03Z

试试训练时把-o指令去掉

thx~ 问题解决了

haomayang1126 · 2021-07-26T09:21:57Z

It's a known error #91 . We are working on it now.

delete -o in command，it works

lyp-oss · 2021-07-27T23:52:04Z

试试训练时把-o指令去掉

我去掉-o之后还是出现这个问题：
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 10.76 GiB total capacity; 9.62 GiB already allocated; 27.50 MiB free; 9.72 GiB reserved in total by PyTorch)

GOATmessi8 · 2021-07-28T00:00:20Z

Then you have to reduce your batchsize or choose a small model like yolox-tiny or yolox-s

LamnouarMohamed · 2022-09-13T13:34:40Z

Hi,

I didn't occurred this problem according of training, but when I was tried to test our model after converted using trt, I found same problem how can I solve it ?

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch) #124

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch) #124

haomayang1126 commented Jul 24, 2021

Joker316701882 commented Jul 25, 2021 •

edited

Loading

1VeniVediVeci1 commented Jul 25, 2021

haomayang1126 commented Jul 26, 2021

haomayang1126 commented Jul 26, 2021

lyp-oss commented Jul 27, 2021

GOATmessi8 commented Jul 28, 2021

LamnouarMohamed commented Sep 13, 2022

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch) #124

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch) #124

Comments

haomayang1126 commented Jul 24, 2021

Joker316701882 commented Jul 25, 2021 • edited Loading

1VeniVediVeci1 commented Jul 25, 2021

haomayang1126 commented Jul 26, 2021

haomayang1126 commented Jul 26, 2021

lyp-oss commented Jul 27, 2021

GOATmessi8 commented Jul 28, 2021

LamnouarMohamed commented Sep 13, 2022

Joker316701882 commented Jul 25, 2021 •

edited

Loading