🌟💡 YOLOv5 Study: mAP vs Batch-Size #2452

glenn-jocher · 2021-03-05T21:28:08Z

glenn-jocher
Mar 5, 2021
Maintainer

Study 🤔

I did a quick study to examine the effect of varying batch size on YOLOv5 trainings. The study trained YOLOv5s on COCO for 300 epochs with --batch-size at 8 different values: [16, 20, 32, 40, 64, 80, 96, 128].

We've tried to make the train code batch-size agnostic, so that users get similar results at any batch size. This means users on a 11 GB 2080 Ti should be able to produce the same results as users on a 24 GB 3090 or a 40 GB A100, with smaller GPUs simply using smaller batch sizes.

We do this by scaling loss with batch size, and also by scaling weight decay with batch size. At batch sizes smaller than 64 we accumulate loss before optimizing, and at batch sizes above 64 we optimize after every batch.

Results 😃

Initial results vary significantly with batch size, but final results are nearly identical (good!). Full details available in our W&B project here: https://wandb.ai/glenn-jocher/batch_size

Closeup of [email protected]:0.95:

One oddity that stood out is val objectness loss, which did vary with batch-size. I'm not sure why, as val-box and val-cls did not vary much, and neither did the 3 train losses. I don't know what this means or if there's any room for concern (or improvement).

abhiagwl4262 · 2021-03-07T14:13:36Z

abhiagwl4262
Mar 7, 2021

@glenn-jocher May be when we train for large number of epochs then we don't see significant improvement. I did experiment for batch size of 32 and 48 and I got better result when I trained with larger Batch Size. I trained for 50 epochs. And it happened on multiple datasets.

0 replies

glenn-jocher · 2021-03-08T01:39:03Z

glenn-jocher
Mar 8, 2021
Maintainer Author

@abhiagwl4262 we always recommend you train on the largest batch-size possible, not so much for better performance, as the above results don't indicate higher performance with higher batch size, but certainly for faster training and better resource utilization.

Multi-GPU may add another angle to the above story though, as larger batch sizes there may help contribute to better results, at least in early training, since the batchnorm stats are split there among your CUDA devices.

1 reply

BinZhu-ece Dec 7, 2021

batchsize 256 is ok?

abhiagwl4262 · 2021-03-08T18:52:56Z

abhiagwl4262
Mar 8, 2021

@glenn-jocher Is High Batch good, even for very small dataset e.g 200 images per class ?

0 replies

glenn-jocher · 2021-03-08T19:55:03Z

glenn-jocher
Mar 8, 2021
Maintainer Author

@abhiagwl4262 maybe, as long as you maintain a similar number of iterations. For very small datasets this may require significantly increasing training epochs, i.e. to several thousand, or until you observe overfitting.

0 replies

cszer · 2021-03-08T22:24:50Z

cszer
Mar 8, 2021

Hey, good thing to study. But i need to notice that results with sync BN are not reproducible for me. I have trained yolo m model on 8 tesla a100 gpus with batch size 256 because ddp only supports gloo backend and 0 GPU was loaded 50% more than others. (cuda 11). It will be good to compare syncbn with BN training.

0 replies

glenn-jocher · 2021-03-08T23:07:06Z

glenn-jocher
Mar 8, 2021
Maintainer Author

@cszer thanks for the comments! Yes a --sync study would be interesting as well. What you are your observations with and without --sync?

Excess CUDA device 0 memory usage was previously related to too-large batch sizes on device 0 when testing, but this bug was fixed on February 6th as part of PR #2148. If your results are from before that then you may want to update your code and see if the problem has been fixed.

3 replies

kehuanfeng Apr 9, 2021

I am still facing the same issue as @cszer that GPU 0 consumes much more memory than others on 8 * V100 machine even with https://github.com/ultralytics/yolov5/pull/2148/files

kehuanfeng Apr 9, 2021

No matter with or without sync bn, the behavior is similar as shown below (yolov5s for batch 512 for 8 * V100) and this prevent us from training with larger batch.

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    731320      C   python                          25083MiB |
|    1   N/A  N/A    731320      C   python                          12971MiB |
|    2   N/A  N/A    731320      C   python                          12667MiB |
|    3   N/A  N/A    731320      C   python                          12965MiB |
|    4   N/A  N/A    731320      C   python                          12985MiB |
|    5   N/A  N/A    731320      C   python                          13027MiB |
|    6   N/A  N/A    731320      C   python                          13425MiB |
|    7   N/A  N/A    731320      C   python                          13369MiB |

kehuanfeng Apr 9, 2021

Please ignore my previous question, and it's my fault that I am using DP to do multi-GPU training.
With DDP, it becomes normal now.

cszer · 2021-03-09T01:00:05Z

cszer
Mar 9, 2021

1-2 05:095 map lower on coco

0 replies

glenn-jocher · 2021-03-09T03:50:19Z

glenn-jocher
Mar 9, 2021
Maintainer Author

@cszer oh wow, that's a significant difference. Do you mean that you see a drop of -1 or -2 mAP on COCO when not using --sync-bn on a 8x A100 YOLOv5m training at --batch 256? That's much larger than I would have expected. Did you train for 300 epochs?

0 replies

abhiagwl4262 · 2021-03-09T06:59:33Z

abhiagwl4262
Mar 9, 2021

@glenn-jocher One very strange observation. I am able to run 48 batch size on single GPU and not able to run batch size 64 even on 2 GPUs. Is there some bug in multi-GPU implementation ?

1 reply

glenn-jocher Mar 23, 2021
Maintainer Author

@abhiagwl4262 if you believe you have a reproducible problem, please raise a new issue using the 🐛 Bug Report template, providing screenshots and a minimum reproducible example to help us better understand and diagnose your problem. Thank you!

ptran1203 · 2021-03-23T06:31:50Z

ptran1203
Mar 23, 2021

I was thinking smaller batch size produce better generalization because there are some resources talk about it, as this paper or this thread.

2 replies

glenn-jocher Mar 23, 2021
Maintainer Author

@ptran1203 that's an interesting idea. Generalization is a difficult task as it's hard to quantify, since if it was quantifiable you'd simply fold that capability into your existing validation dataset.

glenn-jocher Mar 23, 2021
Maintainer Author

@ptran1203 perhaps better generalization may simply be a side effect of slightly worse optimization on your train set, i.e. smaller batch sizes may prevent achieving the lowest possible minima on train set, resulting in a flatter minima.

mshamash · 2021-06-24T17:47:54Z

mshamash
Jun 24, 2021

Thanks for the analysis, very interesting (but good!) results. I'm wondering if you think batch size will have an effect on model results for small object detection (objects are only a couple px when images are full size: 2048x2048px +++)? The input training dataset is pre-tiled images, already resized to fit within 416x416. In the process of running the comparison now but wanted to hear your thoughts as well. Thanks.

2 replies

glenn-jocher Jun 24, 2021
Maintainer Author

@mshamash I believe the study results speak for themselves. If you find any evidence inconsistent with the above let us know.

Garvit-32 Oct 1, 2023

@mshamash I think tiling of the image for small object detection would work better than without tiling. I tried it for many datasets.

zkf331 · 2021-11-04T02:57:07Z

zkf331
Nov 4, 2021

I tested the [email protected]:0.95 of yolov5s-V6.0 under different batch sizes. When batch size is 512 or 1024, the mAP decreases。

We use 8 32G V100 GPUs with DDP training. The test results are as follows

Model	batch=64	batch=128	batch=256	batch=512	batch=1024
yolov5s	37.2	37.3	37.4	35.9	33.8

This is the yolov5s-batch-512.

8 replies

zkf331 Nov 5, 2021

In training classification models, such as resnet50, when we increase batch, the learning rate needs to increase linearly by equal multiples.
I used the above method on yolov5 at batch 1024, and it didn’t work. why do yolov5 by scaling loss and weight decay with batch size, Instead of linearly increasing the learning rate？

glenn-jocher Nov 5, 2021
Maintainer Author

@zkf331 for some optimizers like SGD scaling the loss is equal to scaling the LR. In YOLOv5 we simply multiply loss by batch-size and WORLD_SIZE (if DDP), so we are already effectively doing what you say.

For other optimizers like Adam the two are not equivalent and the process may be a bit different.

Lastly, note that weight_decay is also adapted in YOLOv5. This is the result of some experiments where we found the decay value was not independent of batch size. You may want to experiment with perhaps maintaining constant weight decays instead of adapting them at the larger batch sizes like 512, 1024.

zkf331 Nov 9, 2021

@glenn-jocher Thank you very much for your patience, I'm also surprised at your meticulous research. If there are other tests and questions in the follow-up, I will share and consult with you.

glenn-jocher Nov 9, 2021
Maintainer Author

@zkf331 thanks! BTW I forgot to mention I ran some batch-size experiments on VOC and found slightly reduced performance at batch sizes larger than 64, but VOC is much smaller than COCO and we only fine-tune for 50 epochs, so it's likely that larger batch sizes like 256 are advantageous in large datasets like COCO, Objects365 etc, but for many of the common fine-tuning use cases on smaller datasets better results may be produced at batch <= 64, so I decided to leave the default at 64.

Emyyr Mar 6, 2022

Thanks !

bonnya15 · 2022-05-05T15:46:51Z

bonnya15
May 5, 2022

In this image why validation losses are lower than the training losses? Is there any explanation?

1 reply

glenn-jocher May 5, 2022
Maintainer Author

@bonnya15 training image augmentation

adujardin · 2022-05-11T16:30:26Z

adujardin
May 11, 2022

@glenn-jocher I have probably missed something but could you explain why the loss is scaled with the batch size value, and not divided by it?

yolov5/utils/loss.py

Lines 173 to 175 in 4870064

    
           bs = tobj.shape[0]  # batch size 
        
           return (lbox + lobj + lcls) * bs, torch.cat((lbox, lobj, lcls)).detach()

I feel that this scaling makes it more dependent on the batch size, wouldn't it be better to divide to have some kind of loss normalization?

1 reply

glenn-jocher May 11, 2022
Maintainer Author

@adujardin I'd recommend implementing your proposed change to see the effect for yourself.

simonruber · 2022-09-01T08:10:50Z

simonruber
Sep 1, 2022

Even though it is not 100% related to the batch size discussion, I observed some unexpected training results of different runs on the same data with different numbers of GPUs.

I trained in "distributed data-parallel" mode for the multi-GPU run on 4 GPUs.
Single GPU run used batch size 16, multi-GPU run batch size 64.

I was expecting somehow a similar training behavior for the two yolov5s6 training runs, but they did differ quite significantly.

All models were trained with 200 epochs so that the learning rate decay was identical.

Is this due to the fact that the loss is accumulated until batch size 64 (per device) and then averaged over all devices? So that the model somehow learns slower and needs more epochs in multi-GPU mode?
Can I adjust the batch-size to get comparable training results for multi-GPU training?

3 replies

glenn-jocher Sep 1, 2022
Maintainer Author

Not sure. Default COCO trainings perform nearly identical with single or multi-GPU, so your dataset is probably very small and/or you might be seeing EMA update effects at different batch sizes which only affect initial COCO training.

simonruber Sep 1, 2022

Yes, you're right, the dataset is very small compared to COCO. It consists of only ~20k images. Of which I use 18k for training and 2k
for validation. Regarding the different batch sizes, I was expecting that using 4x16 in 4-GPU training would perform equally to single GPU training with batch size 16 since the losses are accumulated until batch size 64 is reached.

bit-scientist Oct 13, 2022

@simonruber I am experiencing similar differences on my trainings as well. Did you come to some conclusion as to how to set batch-size when using DDP? I read in the comments section where @glenn-jocher suggested using just 1-GPU (DP) training with different h-parameters to be more resource-efficient and get various models, but I am not sure if 1-GPU (DP) training is identical to multi-GPU (DDP) in terms of mAP. I did my trainings from scratch and used a custom dataset of 40,000 images for training. Also, I didn't see any improvements in mAP when used with sync-bn. Please share your insights on this. Thank you.

mehran66 · 2023-01-11T14:55:47Z

mehran66
Jan 11, 2023

For my data, In could manage to have a much higher batch size with DP than DDP, like 32 for DP while I cannot go higher than 8 for per gpu for DDP.

0 replies

ThuanKim22 · 2023-02-14T14:24:46Z

ThuanKim22
Feb 14, 2023

i have 1 data 7 layers with 1079 images and train in yolov5s how much batch size should i use? Thanks.

0 replies

rgaufman · 2023-05-15T22:53:06Z

rgaufman
May 15, 2023

My GPU is a 3060TI and I get a "torch.cuda.OutOfMemoryError: CUDA out of memory" error with batch sizes of 16 and above. I'm currently using 8, are there any negative side effects for using such a small batch size?

My command is:

yolo task=detect mode=train model=yolov8l.pt data=./Trash-Detection-2/data.yaml epochs=300 batch=8

0 replies

🌟💡 YOLOv5 Study: mAP vs Batch-Size #2452

glenn-jocher Mar 5, 2021 Maintainer

Study 🤔

Results 😃

Replies: 18 comments · 22 replies

glenn-jocher Mar 8, 2021 Maintainer Author

glenn-jocher Mar 8, 2021 Maintainer Author

glenn-jocher Mar 8, 2021 Maintainer Author

glenn-jocher Mar 9, 2021 Maintainer Author

glenn-jocher Mar 23, 2021 Maintainer Author

glenn-jocher Mar 23, 2021 Maintainer Author

glenn-jocher Mar 23, 2021 Maintainer Author

glenn-jocher Jun 24, 2021 Maintainer Author

glenn-jocher Nov 5, 2021 Maintainer Author

glenn-jocher Nov 9, 2021 Maintainer Author

glenn-jocher May 5, 2022 Maintainer Author

glenn-jocher May 11, 2022 Maintainer Author

glenn-jocher Sep 1, 2022 Maintainer Author

glenn-jocher
Mar 5, 2021
Maintainer

Replies: 18 comments 22 replies

glenn-jocher
Mar 8, 2021
Maintainer Author

glenn-jocher
Mar 8, 2021
Maintainer Author

glenn-jocher
Mar 8, 2021
Maintainer Author

glenn-jocher
Mar 9, 2021
Maintainer Author

glenn-jocher Mar 23, 2021
Maintainer Author

glenn-jocher Mar 23, 2021
Maintainer Author

glenn-jocher Mar 23, 2021
Maintainer Author

glenn-jocher Jun 24, 2021
Maintainer Author

glenn-jocher Nov 5, 2021
Maintainer Author

glenn-jocher Nov 9, 2021
Maintainer Author

glenn-jocher May 5, 2022
Maintainer Author

glenn-jocher May 11, 2022
Maintainer Author

glenn-jocher Sep 1, 2022
Maintainer Author