CUDA OOM when training on on V100 #260

xuzhao9 · 2021-11-24T03:50:02Z

xuzhao9
Nov 24, 2021

Hello! Thanks for the great work! We are trying to add a mini-reproduction of this model as an example for PyTorch performance study and optimization (pytorch/benchmark#582). The purpose is only for performance and profiling, so we are not interested in the correctness numbers (for now), so what we are doing is to run only 1 batch and 1 epoch on a mini-coco2017 dataset, which contains only 100 images.

When training the model on a single Nvidia V100 GPU (32 GB memory), we encountered CUDA OOM failure (with default batch size of 32 for training):

Traceback (most recent call last):
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/effdet/loss.py", line 203, in loss_fn
    cls_loss = new_focal_loss(
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/effdet/loss.py", line 81, in new_focal_loss
    ce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/functional.py", line 3091, in binary_cross_entropy_with_logits
    return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 634.00 MiB (GPU 0; 31.75 GiB total capacity; 27.47 GiB already allocated; 541.75 MiB free; 29.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

May I ask if this is the expected behaviour? When training with a single GPU, what is the minimal GPU memory requirement?

Answered by rwightman

Nov 25, 2021

@xuzhao9 there isn't really enough information there for me to say if this is expected, but these models are very memory hungry. I train on 2 RTX Titan or 3090 (24GB cards) and am often in the 12-24 batch size per card range for lower end models.

Always train with AMP. And the memory quickly goes up as you increase the model size so stick to the D0 model for tests.

View full answer

rwightman · 2021-11-25T17:22:08Z

rwightman
Nov 25, 2021
Maintainer

@xuzhao9 there isn't really enough information there for me to say if this is expected, but these models are very memory hungry. I train on 2 RTX Titan or 3090 (24GB cards) and am often in the 12-24 batch size per card range for lower end models.

Always train with AMP. And the memory quickly goes up as you increase the model size so stick to the D0 model for tests.

1 reply

xuzhao9 Nov 27, 2021
Author

Thank you for your suggestions! I will tune down the model scale and try again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM when training on on V100 #260

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

CUDA OOM when training on on V100 #260

xuzhao9 Nov 24, 2021

Replies: 1 comment · 1 reply

rwightman Nov 25, 2021 Maintainer

xuzhao9 Nov 27, 2021 Author

xuzhao9
Nov 24, 2021

Replies: 1 comment 1 reply

rwightman
Nov 25, 2021
Maintainer

xuzhao9 Nov 27, 2021
Author