You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch)
#124
Open
haomayang1126 opened this issue
Jul 24, 2021
· 7 comments
==============================================================================
2021-07-24 17:55:22 | INFO | yolox.core.trainer:188 - ---> start train epoch1
2021-07-24 17:55:26 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
2021-07-24 17:55:28 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
2021-07-24 17:55:30 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
2021-07-24 17:55:37 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 10/40, mem: 4660Mb, iter_time: 1.570s, data_time: 0.867s, total_loss: 11.0, iou_loss: 3.0, l1_loss: 0.0, conf_
loss: 5.7, cls_loss: 2.3, lr: 1.953e-06, size: 640, ETA: 0:31:08
2021-07-24 17:55:43 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
2021-07-24 17:55:51 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 20/40, mem: 4660Mb, iter_time: 1.333s, data_time: 0.762s, total_loss: 10.1, iou_loss: 2.8, l1_loss: 0.0, conf_
loss: 4.5, cls_loss: 2.8, lr: 7.813e-06, size: 576, ETA: 0:28:32
2021-07-24 17:55:53 | INFO | yolox.core.trainer:183 - Training of experiment is done and the best AP is 0.00
2021-07-24 17:55:53 | ERROR | yolox.core.launch:73 - An error has been caught in function 'launch', process 'MainProcess' (5488), thread 'MainThread' (6852):
Traceback (most recent call last):
File "tools\train.py", line 111, in
launch(
└ <function launch at 0x00000126EC829E50>
File "g:\pythonproject\yolox-main\yolox\core\launch.py", line 73, in launch
main_func(*args)
│ └ (╒══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...
└ <function main at 0x00000126EEA76DC0>
File "tools\train.py", line 101, in main
trainer.train()
│ └ <function Trainer.train at 0x00000126EDDBCD30>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 70, in train
self.train_in_epoch()
│ └ <function Trainer.train_in_epoch at 0x00000126EEA44F70>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 79, in train_in_epoch
self.train_in_iter()
│ └ <function Trainer.train_in_iter at 0x00000126EEA55280>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 85, in train_in_iter
self.train_one_iter()
│ └ <function Trainer.train_one_iter at 0x00000126EEA55310>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 91, in train_one_iter
inps, targets = self.prefetcher.next()
│ │ └ <function DataPrefetcher.next at 0x00000126EDDBC310>
│ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 48, in next
self.preload()
│ └ <function DataPrefetcher.preload at 0x00000126EDDBC280>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 37, in preload
self.input_cuda()
│ └ <bound method DataPrefetcher._input_cuda_for_image of <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch)
(Swin) G:\Pythonproject\YOLOX-main>
The text was updated successfully, but these errors were encountered:
我去掉-o之后还是出现这个问题:
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 10.76 GiB total capacity; 9.62 GiB already allocated; 27.50 MiB free; 9.72 GiB reserved in total by PyTorch)
I didn't occurred this problem according of training, but when I was tried to test our model after converted using trt, I found same problem how can I solve it ?
输入命令刚开始执行,显卡的内存就被占满了,有时候在第一轮出错,有时候在第三轮,
num_works 设置了4,2,0都是同样的问题,
环境
python 3.8
pytorch1.8.1
cuda10.1
==============================================================================
2021-07-24 17:55:22 | INFO | yolox.core.trainer:188 - ---> start train epoch1
2021-07-24 17:55:26 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
2021-07-24 17:55:28 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
2021-07-24 17:55:30 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
2021-07-24 17:55:37 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 10/40, mem: 4660Mb, iter_time: 1.570s, data_time: 0.867s, total_loss: 11.0, iou_loss: 3.0, l1_loss: 0.0, conf_
loss: 5.7, cls_loss: 2.3, lr: 1.953e-06, size: 640, ETA: 0:31:08
2021-07-24 17:55:43 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
2021-07-24 17:55:51 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 20/40, mem: 4660Mb, iter_time: 1.333s, data_time: 0.762s, total_loss: 10.1, iou_loss: 2.8, l1_loss: 0.0, conf_
loss: 4.5, cls_loss: 2.8, lr: 7.813e-06, size: 576, ETA: 0:28:32
2021-07-24 17:55:53 | INFO | yolox.core.trainer:183 - Training of experiment is done and the best AP is 0.00
2021-07-24 17:55:53 | ERROR | yolox.core.launch:73 - An error has been caught in function 'launch', process 'MainProcess' (5488), thread 'MainThread' (6852):
Traceback (most recent call last):
File "tools\train.py", line 111, in
launch(
└ <function launch at 0x00000126EC829E50>
File "tools\train.py", line 101, in main
trainer.train()
│ └ <function Trainer.train at 0x00000126EDDBCD30>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 70, in train
self.train_in_epoch()
│ └ <function Trainer.train_in_epoch at 0x00000126EEA44F70>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 79, in train_in_epoch
self.train_in_iter()
│ └ <function Trainer.train_in_iter at 0x00000126EEA55280>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 85, in train_in_iter
self.train_one_iter()
│ └ <function Trainer.train_one_iter at 0x00000126EEA55310>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 91, in train_one_iter
inps, targets = self.prefetcher.next()
│ │ └ <function DataPrefetcher.next at 0x00000126EDDBC310>
│ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>
File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 48, in next
self.preload()
│ └ <function DataPrefetcher.preload at 0x00000126EDDBC280>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 37, in preload
self.input_cuda()
│ └ <bound method DataPrefetcher._input_cuda_for_image of <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 52, in _input_cuda_for_image
self.next_input = self.next_input.cuda(non_blocking=True)
│ │ │ │ └ <method 'cuda' of 'torch._C._TensorBase' objects>
│ │ │ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424],
│ │ │ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ...
│ │ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
│ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424],
│ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ...
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch)
(Swin) G:\Pythonproject\YOLOX-main>
The text was updated successfully, but these errors were encountered: