Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No predictions from model #21

Open
normster opened this issue May 25, 2022 · 3 comments
Open

No predictions from model #21

normster opened this issue May 25, 2022 · 3 comments

Comments

@normster
Copy link

Hi,

I'm trying to train a detection model with the plain ViT backbone on 8 GPUs (by scaling down batch size + lr 4x) using the 100 epoch config. Training seems to progress nicely until evaluation, at which point I get the following log statements:

[05/25 13:32:36 fvcore.common.checkpoint]: Saving checkpoint to output/benchmarking_mask_rcnn_base_FPN_100ep_LSJ_mae/model_0005531.pth
[05/25 13:32:40 d2.data.datasets.coco]: Loaded 5000 images in COCO format from datasets/coco/annotations/instances_val2017.json
[05/25 13:32:41 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(1024, 1024), max_size=1024), FixedSizeCrop(crop_size=[1024, 1024])]
[05/25 13:32:41 d2.data.common]: Serializing 5000 elements to byte tensors and concatenating them all ...
[05/25 13:32:41 d2.data.common]: Serialized dataset takes 19.10 MiB
[05/25 13:32:41 d2.evaluation.evaluator]: Start inference on 625 images
[05/25 13:32:58 d2.evaluation.evaluator]: Inference done 11/625. 0.3023 s / img. ETA=0:03:10
[05/25 13:33:03 d2.evaluation.evaluator]: Inference done 28/625. 0.3014 s / img. ETA=0:03:03
[05/25 13:33:08 d2.evaluation.evaluator]: Inference done 45/625. 0.2988 s / img. ETA=0:02:56
[05/25 13:33:13 d2.evaluation.evaluator]: Inference done 61/625. 0.3007 s / img. ETA=0:02:53
[05/25 13:33:19 d2.evaluation.evaluator]: Inference done 78/625. 0.3015 s / img. ETA=0:02:48
[05/25 13:33:24 d2.evaluation.evaluator]: Inference done 94/625. 0.3023 s / img. ETA=0:02:44
[05/25 13:33:29 d2.evaluation.evaluator]: Inference done 110/625. 0.3029 s / img. ETA=0:02:39
[05/25 13:33:34 d2.evaluation.evaluator]: Inference done 127/625. 0.3021 s / img. ETA=0:02:34
[05/25 13:33:39 d2.evaluation.evaluator]: Inference done 144/625. 0.3020 s / img. ETA=0:02:28
[05/25 13:33:44 d2.evaluation.evaluator]: Inference done 161/625. 0.3016 s / img. ETA=0:02:23
[05/25 13:33:49 d2.evaluation.evaluator]: Inference done 177/625. 0.3020 s / img. ETA=0:02:18
[05/25 13:33:54 d2.evaluation.evaluator]: Inference done 193/625. 0.3030 s / img. ETA=0:02:14
[05/25 13:34:00 d2.evaluation.evaluator]: Inference done 210/625. 0.3029 s / img. ETA=0:02:08
[05/25 13:34:05 d2.evaluation.evaluator]: Inference done 226/625. 0.3032 s / img. ETA=0:02:04
[05/25 13:34:10 d2.evaluation.evaluator]: Inference done 242/625. 0.3033 s / img. ETA=0:01:59
[05/25 13:34:15 d2.evaluation.evaluator]: Inference done 259/625. 0.3029 s / img. ETA=0:01:53
[05/25 13:34:20 d2.evaluation.evaluator]: Inference done 275/625. 0.3031 s / img. ETA=0:01:48
[05/25 13:34:25 d2.evaluation.evaluator]: Inference done 292/625. 0.3029 s / img. ETA=0:01:43
[05/25 13:34:31 d2.evaluation.evaluator]: Inference done 309/625. 0.3028 s / img. ETA=0:01:38
[05/25 13:34:36 d2.evaluation.evaluator]: Inference done 326/625. 0.3027 s / img. ETA=0:01:32
[05/25 13:34:41 d2.evaluation.evaluator]: Inference done 342/625. 0.3028 s / img. ETA=0:01:27
[05/25 13:34:46 d2.evaluation.evaluator]: Inference done 359/625. 0.3026 s / img. ETA=0:01:22
[05/25 13:34:51 d2.evaluation.evaluator]: Inference done 376/625. 0.3022 s / img. ETA=0:01:17
[05/25 13:34:56 d2.evaluation.evaluator]: Inference done 393/625. 0.3021 s / img. ETA=0:01:11
[05/25 13:35:02 d2.evaluation.evaluator]: Inference done 410/625. 0.3022 s / img. ETA=0:01:06
[05/25 13:35:07 d2.evaluation.evaluator]: Inference done 426/625. 0.3024 s / img. ETA=0:01:01
[05/25 13:35:12 d2.evaluation.evaluator]: Inference done 443/625. 0.3018 s / img. ETA=0:00:56
[05/25 13:35:17 d2.evaluation.evaluator]: Inference done 460/625. 0.3018 s / img. ETA=0:00:51
[05/25 13:35:22 d2.evaluation.evaluator]: Inference done 477/625. 0.3016 s / img. ETA=0:00:45
[05/25 13:35:27 d2.evaluation.evaluator]: Inference done 493/625. 0.3020 s / img. ETA=0:00:40
[05/25 13:35:33 d2.evaluation.evaluator]: Inference done 510/625. 0.3020 s / img. ETA=0:00:35
[05/25 13:35:38 d2.evaluation.evaluator]: Inference done 527/625. 0.3019 s / img. ETA=0:00:30
[05/25 13:35:43 d2.evaluation.evaluator]: Inference done 543/625. 0.3022 s / img. ETA=0:00:25
[05/25 13:35:48 d2.evaluation.evaluator]: Inference done 560/625. 0.3021 s / img. ETA=0:00:20
[05/25 13:35:53 d2.evaluation.evaluator]: Inference done 577/625. 0.3018 s / img. ETA=0:00:14
[05/25 13:35:58 d2.evaluation.evaluator]: Inference done 593/625. 0.3019 s / img. ETA=0:00:09
[05/25 13:36:03 d2.evaluation.evaluator]: Inference done 610/625. 0.3018 s / img. ETA=0:00:04
[05/25 13:36:08 d2.evaluation.evaluator]: Total inference time: 0:03:12.073198 (0.309795 s / img per device, on 8 devices)
[05/25 13:36:08 d2.evaluation.evaluator]: Total inference pure compute time: 0:03:06 (0.301541 s / img per device, on 8 devices)
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Preparing results for COCO format ...
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Saving results to output/benchmarking_mask_rcnn_base_FPN_100ep_LSJ_mae/coco_instances_results.json
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Evaluating predictions with unofficial COCO API...
WARNING [05/25 13:36:08 d2.evaluation.coco_evaluation]: No predictions from the model!
[05/25 13:36:08 d2.evaluation.testing]: copypaste: Task: bbox
[05/25 13:36:08 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[05/25 13:36:08 d2.evaluation.testing]: copypaste: nan,nan,nan,nan,nan,nan
[05/25 13:36:15 d2.utils.events]:  eta: 1 day, 16:30:37  iter: 5539  total_loss: 1.138  loss_cls: 0.2928  loss_box_reg: 0.2585  loss_mask: 0.3725  loss_rpn_cls: 0.06554  loss_rpn_loc: 0.1448  time: 0.8175  data_time: 0.0220  lr: 1.9955e-05  max_mem: 26732M
[05/25 13:36:31 d2.utils.events]:  eta: 1 day, 16:30:20  iter: 5559  total_loss: 1.207  loss_cls: 0.3106  loss_box_reg: 0.2719  loss_mask: 0.3847  loss_rpn_cls: 0.06758  loss_rpn_loc: 0.1353  time: 0.8175  data_time: 0.0225  lr: 1.9955e-05  max_mem: 26732M

Has anyone else seen this before? Training continues without any apparent problems after eval so it's not an issue with divergence.

Thanks!

@Yuxin-CV
Copy link
Member

Hi, @normster. Thanks for your interest in our work.
I suggest scaling down the batch size by 4x & scaling down the lr by 2x.

@Yuxin-CV
Copy link
Member

I also suggest aligning your environment with ours, please also see SysCV/transfiner#17 (comment).

@normster
Copy link
Author

What environment should I use? The environment in the comment you linked differs from what was suggested in the readme of this repo.

I don't think torch/d2 versions are the cause of this: running evaluation on downloaded weights gives predictions and results are in line with reported numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants