Network does not converge from scratch #169

ririya · 2020-07-30T18:03:57Z

I've been trying to train the network from scratch using a custom dataset of around 100K images and 8 classes. usually the network trains until it reaches a training loss of ~23 and then gets stuck there no matter how many epochs run.

The only thing that works is transfer learning from the trained coco models provided and replacing the transformer with a new number of classes and queries (I've been using 20 queries).

I actually got decent results doing the above scheme, but the model is still outperformed by other models such as EfficientDet. So my next step is trying to replace the backbone with an EfficientNet architecture.

The problem is not EfficientNet itself since i am having convergence issues training from scratch even with the original backbones. But I do believe some backbones make it harder to converge.

Here are a couple things I tried:

Importing the transformer part form the pretrained coco models and replacing the backbone, keeping the query and class layers as 100 and 91 and also replacing only those layers with 20 / 8 layers.
Changing the optimizer (Adam, AdamW, RmsProp)
Changing the learning rate from 10e-3 to 10e-6
Changing batch size (This one worked for smaller backbones such as Resnet50)
Using normal batch norm layers instead of frozen batch norm
Changing the image size. The original are 1280 x720 I tried half and quarter size images. I noticed that larger images also make it harder to converge.

I also made a few modifications to the code:

Removed all augmentation
Made all layers learnable

I was able to make Resnet50 converge under certain situations, with large batch size, reduced size images and certain learning rates. However, switching to a larger Resnet or changing any of the parameters breaks the training again.

alcinos · 2020-08-05T16:45:33Z

Hi @ririya

Thank you for your interest in DETR.
I have a couple of questions:

I've been trying to train the network from scratch

Define "from scratch" here. Do you at least use an ImageNet pre-trained backbone? Note that without this, it is not trivial to make the network converge, even on Coco (see #157)

it reaches a training loss of ~23

Training loss is not very informative, what is the mAP and how does it compare to your EfficientDet baseline?

ririya · 2020-08-05T16:56:07Z

Hi @alcinos Thx for replying!

All my backbones are pretrained on imagenet.

Whenever the training gets stuck, the mAP also gets stuck below 0.1.

As of now I was able to make it converge using Resnet101, Resnet101-DC5 and also Resnest101.

I was able to train it using the aforementioned mods. I’m always importing the trained transformer from the Resnet101 model and replacing what I need.

Using all backbones my results are comparable to EfficientDet D1 (around 0.5 mAP on my dataset)

However it still doesnt work with the Efficientnet backbone.

alcinos · 2020-08-05T17:14:05Z

I haven't experimented with EfficientNet so I can't really offer you any guidance there. It might depend on the exact way it is pre-trained. You could try increasing a bit the backbone learning rate, eg 5e-5 for example and see if it helps.

For the rest, I'm a bit surprised that it is so hard for you to converge with Imagenet pre-trained resnet backbone and scratch transformer. Maybe your data distribution is very different than coco and you need to think about other data-augmentations that may make sense.

Otherwise, I think it is perfectly fine to rely on fine-tuning from a coco-pretrained model as you are doing. I don't really think you need to fiddle with the number of queries though, 100 should be fine.

Best of luck

ririya · 2020-08-05T17:29:55Z

Thanks @alcinos I’ll try to follow your suggestions. I’l already getting good results but just wondering why it’s so hard to get this working sometimes.

ririya · 2020-08-20T19:00:17Z

I was finally able to run i with efficientnet. I think there was a problem with the imported imagenet weights.

However resnet101-dc5 still gives me the best results. It is now beating EfficientDet. However inference time is 30 ms more. I've modified efficientnet to include dilations, as they seem to be critical. Anxious for the results.

fmassa · 2020-08-21T08:05:19Z

@ririya keep us updated! We are doing some preliminary experiments with EfficientNets and they do seem to work fairly well with DETR.

munirfarzeen · 2020-12-14T02:52:43Z

Hi,
@ririya could you share your hyperparameters you use to train with efficientnet as a backbone.
That would be great help.

ririya · 2020-12-14T05:11:31Z

Hi,
@ririya could you share your hyperparameters you use to train with efficientnet as a backbone.
That would be great help.

@likui01

The only thing I modified was a learning rate of 10-5 for both backbone and detr and I’m using 30 queries because my images dont have a lot of objects. One thing that helped with the convergence was importing the trained transformers from of one the given models and replacing what I needed. I also tried a few different pretrained efficientnet models one of them did not work, maybe there was some problem with the imagenet weights. Hope this helps you.

munirfarzeen · 2020-12-14T13:09:27Z

@ririya thank you for your reply. I tried changing the learning rate like you suggested but my network is still not learning as you can see in the figure. i am using mobilenet_v2 backbone from pytorch

ririya · 2020-12-14T23:26:38Z

@likui01 I haven't tried mobilenet_v2. Does your training converge using the provided Resnet50 pretrained models?

munirfarzeen · 2020-12-15T01:53:43Z

@ririya , yes it does converge with resent50, using pretrain weights

cyy21 mentioned this issue Nov 26, 2020

Is detr need a big batch_size and at least 10k dataset? #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network does not converge from scratch #169

Network does not converge from scratch #169

ririya commented Jul 30, 2020

alcinos commented Aug 5, 2020

ririya commented Aug 5, 2020 •

edited

Loading

alcinos commented Aug 5, 2020

ririya commented Aug 5, 2020

ririya commented Aug 20, 2020

fmassa commented Aug 21, 2020

munirfarzeen commented Dec 14, 2020

ririya commented Dec 14, 2020

munirfarzeen commented Dec 14, 2020

ririya commented Dec 14, 2020

munirfarzeen commented Dec 15, 2020

Network does not converge from scratch #169

Network does not converge from scratch #169

Comments

ririya commented Jul 30, 2020

alcinos commented Aug 5, 2020

ririya commented Aug 5, 2020 • edited Loading

alcinos commented Aug 5, 2020

ririya commented Aug 5, 2020

ririya commented Aug 20, 2020

fmassa commented Aug 21, 2020

munirfarzeen commented Dec 14, 2020

ririya commented Dec 14, 2020

munirfarzeen commented Dec 14, 2020

ririya commented Dec 14, 2020

munirfarzeen commented Dec 15, 2020

ririya commented Aug 5, 2020 •

edited

Loading