🌟💡 YOLOv5 Study: mAP vs Batch-Size #2452
Replies: 18 comments 22 replies
-
@glenn-jocher May be when we train for large number of epochs then we don't see significant improvement. I did experiment for batch size of 32 and 48 and I got better result when I trained with larger Batch Size. I trained for 50 epochs. And it happened on multiple datasets. |
Beta Was this translation helpful? Give feedback.
-
@abhiagwl4262 we always recommend you train on the largest batch-size possible, not so much for better performance, as the above results don't indicate higher performance with higher batch size, but certainly for faster training and better resource utilization. Multi-GPU may add another angle to the above story though, as larger batch sizes there may help contribute to better results, at least in early training, since the batchnorm stats are split there among your CUDA devices. |
Beta Was this translation helpful? Give feedback.
-
@glenn-jocher Is High Batch good, even for very small dataset e.g 200 images per class ? |
Beta Was this translation helpful? Give feedback.
-
@abhiagwl4262 maybe, as long as you maintain a similar number of iterations. For very small datasets this may require significantly increasing training epochs, i.e. to several thousand, or until you observe overfitting. |
Beta Was this translation helpful? Give feedback.
-
Hey, good thing to study. But i need to notice that results with sync BN are not reproducible for me. I have trained yolo m model on 8 tesla a100 gpus with batch size 256 because ddp only supports gloo backend and 0 GPU was loaded 50% more than others. (cuda 11). It will be good to compare syncbn with BN training. |
Beta Was this translation helpful? Give feedback.
-
@cszer thanks for the comments! Yes a --sync study would be interesting as well. What you are your observations with and without --sync? Excess CUDA device 0 memory usage was previously related to too-large batch sizes on device 0 when testing, but this bug was fixed on February 6th as part of PR #2148. If your results are from before that then you may want to update your code and see if the problem has been fixed. |
Beta Was this translation helpful? Give feedback.
-
1-2 05:095 map lower on coco |
Beta Was this translation helpful? Give feedback.
-
@cszer oh wow, that's a significant difference. Do you mean that you see a drop of -1 or -2 mAP on COCO when not using |
Beta Was this translation helpful? Give feedback.
-
@glenn-jocher One very strange observation. I am able to run 48 batch size on single GPU and not able to run batch size 64 even on 2 GPUs. Is there some bug in multi-GPU implementation ? |
Beta Was this translation helpful? Give feedback.
-
I was thinking smaller batch size produce better generalization because there are some resources talk about it, as this paper or this thread. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the analysis, very interesting (but good!) results. I'm wondering if you think batch size will have an effect on model results for small object detection (objects are only a couple px when images are full size: 2048x2048px +++)? The input training dataset is pre-tiled images, already resized to fit within 416x416. In the process of running the comparison now but wanted to hear your thoughts as well. Thanks. |
Beta Was this translation helpful? Give feedback.
-
I tested the [email protected]:0.95 of yolov5s-V6.0 under different batch sizes. When batch size is 512 or 1024, the mAP decreases。 We use 8 32G V100 GPUs with DDP training. The test results are as follows
|
Beta Was this translation helpful? Give feedback.
-
In this image why validation losses are lower than the training losses? Is there any explanation? |
Beta Was this translation helpful? Give feedback.
-
@glenn-jocher I have probably missed something but could you explain why the loss is scaled with the batch size value, and not divided by it? Lines 173 to 175 in 4870064 I feel that this scaling makes it more dependent on the batch size, wouldn't it be better to divide to have some kind of loss normalization? |
Beta Was this translation helpful? Give feedback.
-
Even though it is not 100% related to the batch size discussion, I observed some unexpected training results of different runs on the same data with different numbers of GPUs. I trained in "distributed data-parallel" mode for the multi-GPU run on 4 GPUs. I was expecting somehow a similar training behavior for the two yolov5s6 training runs, but they did differ quite significantly. All models were trained with 200 epochs so that the learning rate decay was identical. Is this due to the fact that the loss is accumulated until batch size 64 (per device) and then averaged over all devices? So that the model somehow learns slower and needs more epochs in multi-GPU mode? |
Beta Was this translation helpful? Give feedback.
-
For my data, In could manage to have a much higher batch size with DP than DDP, like 32 for DP while I cannot go higher than 8 for per gpu for DDP. |
Beta Was this translation helpful? Give feedback.
-
i have 1 data 7 layers with 1079 images and train in yolov5s how much batch size should i use? Thanks. |
Beta Was this translation helpful? Give feedback.
-
My GPU is a 3060TI and I get a "torch.cuda.OutOfMemoryError: CUDA out of memory" error with batch sizes of 16 and above. I'm currently using 8, are there any negative side effects for using such a small batch size? My command is:
|
Beta Was this translation helpful? Give feedback.
-
Study 🤔
I did a quick study to examine the effect of varying batch size on YOLOv5 trainings. The study trained YOLOv5s on COCO for 300 epochs with
--batch-size
at 8 different values:[16, 20, 32, 40, 64, 80, 96, 128]
.We've tried to make the train code batch-size agnostic, so that users get similar results at any batch size. This means users on a 11 GB 2080 Ti should be able to produce the same results as users on a 24 GB 3090 or a 40 GB A100, with smaller GPUs simply using smaller batch sizes.
We do this by scaling loss with batch size, and also by scaling weight decay with batch size. At batch sizes smaller than 64 we accumulate loss before optimizing, and at batch sizes above 64 we optimize after every batch.
Results 😃
Initial results vary significantly with batch size, but final results are nearly identical (good!). Full details available in our W&B project here: https://wandb.ai/glenn-jocher/batch_size
Closeup of [email protected]:0.95:
One oddity that stood out is val objectness loss, which did vary with batch-size. I'm not sure why, as val-box and val-cls did not vary much, and neither did the 3 train losses. I don't know what this means or if there's any room for concern (or improvement).
Beta Was this translation helpful? Give feedback.
All reactions