-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Network does not converge from scratch #169
Comments
Hi @ririya Thank you for your interest in DETR.
Define "from scratch" here. Do you at least use an ImageNet pre-trained backbone? Note that without this, it is not trivial to make the network converge, even on Coco (see #157)
Training loss is not very informative, what is the mAP and how does it compare to your EfficientDet baseline? |
Hi @alcinos Thx for replying! All my backbones are pretrained on imagenet. Whenever the training gets stuck, the mAP also gets stuck below 0.1. As of now I was able to make it converge using Resnet101, Resnet101-DC5 and also Resnest101. I was able to train it using the aforementioned mods. I’m always importing the trained transformer from the Resnet101 model and replacing what I need. Using all backbones my results are comparable to EfficientDet D1 (around 0.5 mAP on my dataset) However it still doesnt work with the Efficientnet backbone. |
I haven't experimented with EfficientNet so I can't really offer you any guidance there. It might depend on the exact way it is pre-trained. You could try increasing a bit the backbone learning rate, eg 5e-5 for example and see if it helps. For the rest, I'm a bit surprised that it is so hard for you to converge with Imagenet pre-trained resnet backbone and scratch transformer. Maybe your data distribution is very different than coco and you need to think about other data-augmentations that may make sense. Otherwise, I think it is perfectly fine to rely on fine-tuning from a coco-pretrained model as you are doing. I don't really think you need to fiddle with the number of queries though, 100 should be fine. Best of luck |
Thanks @alcinos I’ll try to follow your suggestions. I’l already getting good results but just wondering why it’s so hard to get this working sometimes. |
I was finally able to run i with efficientnet. I think there was a problem with the imported imagenet weights. However resnet101-dc5 still gives me the best results. It is now beating EfficientDet. However inference time is 30 ms more. I've modified efficientnet to include dilations, as they seem to be critical. Anxious for the results. |
@ririya keep us updated! We are doing some preliminary experiments with EfficientNets and they do seem to work fairly well with DETR. |
Hi, |
@likui01 The only thing I modified was a learning rate of 10-5 for both backbone and detr and I’m using 30 queries because my images dont have a lot of objects. One thing that helped with the convergence was importing the trained transformers from of one the given models and replacing what I needed. I also tried a few different pretrained efficientnet models one of them did not work, maybe there was some problem with the imagenet weights. Hope this helps you. |
@ririya thank you for your reply. I tried changing the learning rate like you suggested but my network is still not learning as you can see in the figure. i am using mobilenet_v2 backbone from pytorch |
@likui01 I haven't tried mobilenet_v2. Does your training converge using the provided Resnet50 pretrained models? |
@ririya , yes it does converge with resent50, using pretrain weights |
I've been trying to train the network from scratch using a custom dataset of around 100K images and 8 classes. usually the network trains until it reaches a training loss of ~23 and then gets stuck there no matter how many epochs run.
The only thing that works is transfer learning from the trained coco models provided and replacing the transformer with a new number of classes and queries (I've been using 20 queries).
I actually got decent results doing the above scheme, but the model is still outperformed by other models such as EfficientDet. So my next step is trying to replace the backbone with an EfficientNet architecture.
The problem is not EfficientNet itself since i am having convergence issues training from scratch even with the original backbones. But I do believe some backbones make it harder to converge.
Here are a couple things I tried:
I also made a few modifications to the code:
I was able to make Resnet50 converge under certain situations, with large batch size, reduced size images and certain learning rates. However, switching to a larger Resnet or changing any of the parameters breaks the training again.
The text was updated successfully, but these errors were encountered: