Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training loss is high when I fine tune YOLOX-x model on my own datasets #83

Open
shaoming20798 opened this issue Jul 22, 2021 · 1 comment

Comments

@shaoming20798
Copy link

shaoming20798 commented Jul 22, 2021

Dear author, I use YOLOX-x model to train my own COCO-style datasets which has 18 classes.
Here is my env:

nvidia A40 8 gpus
cuda: 11.2
cudnn: 8.0.5
pytorch: 1.7
torchvision: 0.8.0
apex: 0.1

And my traning command is as follows, I did not use fp16 for training:

python tools/train.py -n yolox-x -d 8 -b 8 -o -c /home/shaom/pretrained/yolox_x.pth.tar --opts data_num_workers 4 num_classes 18 output_dir /home/shaom/YOLOX/outputs max_epoch 100

Meanwhile, I change basic_lr_per_img to 0.0001/8.0
And the training log is as follows, the total loss does not converge:

epoch: 1/100, iter: 10/3018, mem:39233Mb, iter_time:2.277s, data_time: 0.001s, total_loss: 4.7, iou_loss: 0.0, l1_loss: 0.0, conf_loss: 4.7, cls_loss: 0.0, lr: 4.392e-11, size: 640,
epoch: 1/100, iter: 20/3018, mem:39233Mb, iter_time: 2.782s, data_time: 0.001s, total_loss: 13.1, iou_loss: 2.5, l1_loss: 0.0, conf_loss: 7.5, cls_loss: 3.2, lr: 1.757e-10, size: 832, ETA: 8 days, 20:01:58
epoch: 1/100, iter: 30/3018, mem:39233Mb, iter_time: 2.176s, data_time: 0.001s, total_loss: 16.2, iou_loss: 3.7, l1_loss: 0.0, conf_loss: 10.2, cls_loss: 2.3, lr: 3.952e-10, size: 544,
epoch: 1/100, iter: 40/3018, mem:39233Mb, iter_time: 2.142s, data_time: 0.001s, total_loss: 14.2, iou_loss: 2.2, l1_loss: 0.0, conf_loss: 8.1, cls_loss: 3.9, lr: 7.027e-10, size: 736,
epoch: 1/100, iter: 50/3018, mem:39233Mb, iter_time: 0.834s, data_time: 0.001s, total_loss: 11.7, iou_loss: 1.6, l1_loss: 0.0, conf_loss: 6.3, cls_loss: 3.8, lr: 1.098e-09, size: 544,
epoch: 1/100, iter: 60/3018, mem:39233Mb, iter_time: 3.109s, data_time: 0.001s, total_loss: 13.5, iou_loss: 3.2, l1_loss: 0.0, conf_loss: 6.9, cls_loss: 3.4, lr: 1.581e-09, size: 672,
epoch: 1/100, iter: 70/3018, mem:39233Mb, iter_time: 2.119s, data_time: 0.001s, total_loss: 17.2, iou_loss: 2.1, l1_loss: 0.0, conf_loss: 10.2, cls_loss: 4.9, lr: 2.152e-09, size: 704,
epoch: 1/100, iter: 80/3018, mem:39233Mb, iter_time: 1.880s, data_time: 0.001s, total_loss: 13.0, iou_loss: 2.5, l1_loss: 0.0, conf_loss: 6.6, cls_loss: 4.0, lr: 2.811e-09, size: 800,
epoch: 1/100, iter: 90/3018, mem:39233Mb, iter_time: 0.976s, data_time: 0.001s, total_loss: 13.4, iou_loss: 3.4, l1_loss: 0.0, conf_loss: 7.6, cls_loss: 2.4, lr: 3.577e-09, size: 544,
...
epoch: 1/100, iter: 2050/3018, mem:39233Mb, iter_time: 0.969s, data_time: 0.001s, total_loss: 12.5, iou_loss: 3.6, l1_loss: 0.0, conf_loss: 6.2, cls_loss: 2.7, lr: 1.846e-06, size: 480,

@Joker316701882
Copy link
Member

Please provide more details. For example, your exp file, your training log, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants