Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch) #124

Open
haomayang1126 opened this issue Jul 24, 2021 · 7 comments

Comments

@haomayang1126
Copy link

输入命令刚开始执行,显卡的内存就被占满了,有时候在第一轮出错,有时候在第三轮,
num_works 设置了4,2,0都是同样的问题,
环境
python 3.8
pytorch1.8.1
cuda10.1

==============================================================================
2021-07-24 17:55:22 | INFO | yolox.core.trainer:188 - ---> start train epoch1
2021-07-24 17:55:26 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
2021-07-24 17:55:28 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
2021-07-24 17:55:30 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
2021-07-24 17:55:37 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 10/40, mem: 4660Mb, iter_time: 1.570s, data_time: 0.867s, total_loss: 11.0, iou_loss: 3.0, l1_loss: 0.0, conf_
loss: 5.7, cls_loss: 2.3, lr: 1.953e-06, size: 640, ETA: 0:31:08
2021-07-24 17:55:43 | INFO | apex.amp.handle:138 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
2021-07-24 17:55:51 | INFO | yolox.core.trainer:237 - epoch: 1/30, iter: 20/40, mem: 4660Mb, iter_time: 1.333s, data_time: 0.762s, total_loss: 10.1, iou_loss: 2.8, l1_loss: 0.0, conf_
loss: 4.5, cls_loss: 2.8, lr: 7.813e-06, size: 576, ETA: 0:28:32
2021-07-24 17:55:53 | INFO | yolox.core.trainer:183 - Training of experiment is done and the best AP is 0.00
2021-07-24 17:55:53 | ERROR | yolox.core.launch:73 - An error has been caught in function 'launch', process 'MainProcess' (5488), thread 'MainThread' (6852):
Traceback (most recent call last):

File "tools\train.py", line 111, in
launch(
└ <function launch at 0x00000126EC829E50>

File "g:\pythonproject\yolox-main\yolox\core\launch.py", line 73, in launch
main_func(*args)
│ └ (╒══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...
└ <function main at 0x00000126EEA76DC0>

File "tools\train.py", line 101, in main
trainer.train()
│ └ <function Trainer.train at 0x00000126EDDBCD30>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 70, in train
self.train_in_epoch()
│ └ <function Trainer.train_in_epoch at 0x00000126EEA44F70>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 79, in train_in_epoch
self.train_in_iter()
│ └ <function Trainer.train_in_iter at 0x00000126EEA55280>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 85, in train_in_iter
self.train_one_iter()
│ └ <function Trainer.train_one_iter at 0x00000126EEA55310>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\core\trainer.py", line 91, in train_one_iter
inps, targets = self.prefetcher.next()
│ │ └ <function DataPrefetcher.next at 0x00000126EDDBC310>
│ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
└ <yolox.core.trainer.Trainer object at 0x00000126EEAFA970>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 48, in next
self.preload()
│ └ <function DataPrefetcher.preload at 0x00000126EDDBC280>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 37, in preload
self.input_cuda()
│ └ <bound method DataPrefetcher._input_cuda_for_image of <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

File "g:\pythonproject\yolox-main\yolox\data\data_prefetcher.py", line 52, in _input_cuda_for_image
self.next_input = self.next_input.cuda(non_blocking=True)
│ │ │ │ └ <method 'cuda' of 'torch._C._TensorBase' objects>
│ │ │ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424],
│ │ │ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ...
│ │ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>
│ └ tensor([[[[ 0.1426, 0.1426, 0.1254, ..., -0.5253, -0.5424, -0.5424],
│ [ 0.1426, 0.1426, 0.1254, ..., -0.5424, ...
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x00000126F8F7D0D0>

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch)

(Swin) G:\Pythonproject\YOLOX-main>

@Joker316701882
Copy link
Member

Joker316701882 commented Jul 25, 2021

It's a known error #91 . We are working on it now.

@1VeniVediVeci1
Copy link

试试训练时把-o指令去掉

@haomayang1126
Copy link
Author

试试训练时把-o指令去掉

thx~ 问题解决了

@haomayang1126
Copy link
Author

It's a known error #91 . We are working on it now.

delete -o in command,it works

@lyp-oss
Copy link

lyp-oss commented Jul 27, 2021

试试训练时把-o指令去掉

我去掉-o之后还是出现这个问题:
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 10.76 GiB total capacity; 9.62 GiB already allocated; 27.50 MiB free; 9.72 GiB reserved in total by PyTorch)

@GOATmessi8
Copy link
Member

Then you have to reduce your batchsize or choose a small model like yolox-tiny or yolox-s

@LamnouarMohamed
Copy link

Hi,

I didn't occurred this problem according of training, but when I was tried to test our model after converted using trt, I found same problem how can I solve it ?

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants