You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to train a detection model with the plain ViT backbone on 8 GPUs (by scaling down batch size + lr 4x) using the 100 epoch config. Training seems to progress nicely until evaluation, at which point I get the following log statements:
[05/25 13:32:36 fvcore.common.checkpoint]: Saving checkpoint to output/benchmarking_mask_rcnn_base_FPN_100ep_LSJ_mae/model_0005531.pth
[05/25 13:32:40 d2.data.datasets.coco]: Loaded 5000 images in COCO format from datasets/coco/annotations/instances_val2017.json
[05/25 13:32:41 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(1024, 1024), max_size=1024), FixedSizeCrop(crop_size=[1024, 1024])]
[05/25 13:32:41 d2.data.common]: Serializing 5000 elements to byte tensors and concatenating them all ...
[05/25 13:32:41 d2.data.common]: Serialized dataset takes 19.10 MiB
[05/25 13:32:41 d2.evaluation.evaluator]: Start inference on 625 images
[05/25 13:32:58 d2.evaluation.evaluator]: Inference done 11/625. 0.3023 s / img. ETA=0:03:10
[05/25 13:33:03 d2.evaluation.evaluator]: Inference done 28/625. 0.3014 s / img. ETA=0:03:03
[05/25 13:33:08 d2.evaluation.evaluator]: Inference done 45/625. 0.2988 s / img. ETA=0:02:56
[05/25 13:33:13 d2.evaluation.evaluator]: Inference done 61/625. 0.3007 s / img. ETA=0:02:53
[05/25 13:33:19 d2.evaluation.evaluator]: Inference done 78/625. 0.3015 s / img. ETA=0:02:48
[05/25 13:33:24 d2.evaluation.evaluator]: Inference done 94/625. 0.3023 s / img. ETA=0:02:44
[05/25 13:33:29 d2.evaluation.evaluator]: Inference done 110/625. 0.3029 s / img. ETA=0:02:39
[05/25 13:33:34 d2.evaluation.evaluator]: Inference done 127/625. 0.3021 s / img. ETA=0:02:34
[05/25 13:33:39 d2.evaluation.evaluator]: Inference done 144/625. 0.3020 s / img. ETA=0:02:28
[05/25 13:33:44 d2.evaluation.evaluator]: Inference done 161/625. 0.3016 s / img. ETA=0:02:23
[05/25 13:33:49 d2.evaluation.evaluator]: Inference done 177/625. 0.3020 s / img. ETA=0:02:18
[05/25 13:33:54 d2.evaluation.evaluator]: Inference done 193/625. 0.3030 s / img. ETA=0:02:14
[05/25 13:34:00 d2.evaluation.evaluator]: Inference done 210/625. 0.3029 s / img. ETA=0:02:08
[05/25 13:34:05 d2.evaluation.evaluator]: Inference done 226/625. 0.3032 s / img. ETA=0:02:04
[05/25 13:34:10 d2.evaluation.evaluator]: Inference done 242/625. 0.3033 s / img. ETA=0:01:59
[05/25 13:34:15 d2.evaluation.evaluator]: Inference done 259/625. 0.3029 s / img. ETA=0:01:53
[05/25 13:34:20 d2.evaluation.evaluator]: Inference done 275/625. 0.3031 s / img. ETA=0:01:48
[05/25 13:34:25 d2.evaluation.evaluator]: Inference done 292/625. 0.3029 s / img. ETA=0:01:43
[05/25 13:34:31 d2.evaluation.evaluator]: Inference done 309/625. 0.3028 s / img. ETA=0:01:38
[05/25 13:34:36 d2.evaluation.evaluator]: Inference done 326/625. 0.3027 s / img. ETA=0:01:32
[05/25 13:34:41 d2.evaluation.evaluator]: Inference done 342/625. 0.3028 s / img. ETA=0:01:27
[05/25 13:34:46 d2.evaluation.evaluator]: Inference done 359/625. 0.3026 s / img. ETA=0:01:22
[05/25 13:34:51 d2.evaluation.evaluator]: Inference done 376/625. 0.3022 s / img. ETA=0:01:17
[05/25 13:34:56 d2.evaluation.evaluator]: Inference done 393/625. 0.3021 s / img. ETA=0:01:11
[05/25 13:35:02 d2.evaluation.evaluator]: Inference done 410/625. 0.3022 s / img. ETA=0:01:06
[05/25 13:35:07 d2.evaluation.evaluator]: Inference done 426/625. 0.3024 s / img. ETA=0:01:01
[05/25 13:35:12 d2.evaluation.evaluator]: Inference done 443/625. 0.3018 s / img. ETA=0:00:56
[05/25 13:35:17 d2.evaluation.evaluator]: Inference done 460/625. 0.3018 s / img. ETA=0:00:51
[05/25 13:35:22 d2.evaluation.evaluator]: Inference done 477/625. 0.3016 s / img. ETA=0:00:45
[05/25 13:35:27 d2.evaluation.evaluator]: Inference done 493/625. 0.3020 s / img. ETA=0:00:40
[05/25 13:35:33 d2.evaluation.evaluator]: Inference done 510/625. 0.3020 s / img. ETA=0:00:35
[05/25 13:35:38 d2.evaluation.evaluator]: Inference done 527/625. 0.3019 s / img. ETA=0:00:30
[05/25 13:35:43 d2.evaluation.evaluator]: Inference done 543/625. 0.3022 s / img. ETA=0:00:25
[05/25 13:35:48 d2.evaluation.evaluator]: Inference done 560/625. 0.3021 s / img. ETA=0:00:20
[05/25 13:35:53 d2.evaluation.evaluator]: Inference done 577/625. 0.3018 s / img. ETA=0:00:14
[05/25 13:35:58 d2.evaluation.evaluator]: Inference done 593/625. 0.3019 s / img. ETA=0:00:09
[05/25 13:36:03 d2.evaluation.evaluator]: Inference done 610/625. 0.3018 s / img. ETA=0:00:04
[05/25 13:36:08 d2.evaluation.evaluator]: Total inference time: 0:03:12.073198 (0.309795 s / img per device, on 8 devices)
[05/25 13:36:08 d2.evaluation.evaluator]: Total inference pure compute time: 0:03:06 (0.301541 s / img per device, on 8 devices)
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Preparing results for COCO format ...
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Saving results to output/benchmarking_mask_rcnn_base_FPN_100ep_LSJ_mae/coco_instances_results.json
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Evaluating predictions with unofficial COCO API...
WARNING [05/25 13:36:08 d2.evaluation.coco_evaluation]: No predictions from the model!
[05/25 13:36:08 d2.evaluation.testing]: copypaste: Task: bbox
[05/25 13:36:08 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[05/25 13:36:08 d2.evaluation.testing]: copypaste: nan,nan,nan,nan,nan,nan
[05/25 13:36:15 d2.utils.events]: eta: 1 day, 16:30:37 iter: 5539 total_loss: 1.138 loss_cls: 0.2928 loss_box_reg: 0.2585 loss_mask: 0.3725 loss_rpn_cls: 0.06554 loss_rpn_loc: 0.1448 time: 0.8175 data_time: 0.0220 lr: 1.9955e-05 max_mem: 26732M
[05/25 13:36:31 d2.utils.events]: eta: 1 day, 16:30:20 iter: 5559 total_loss: 1.207 loss_cls: 0.3106 loss_box_reg: 0.2719 loss_mask: 0.3847 loss_rpn_cls: 0.06758 loss_rpn_loc: 0.1353 time: 0.8175 data_time: 0.0225 lr: 1.9955e-05 max_mem: 26732M
Has anyone else seen this before? Training continues without any apparent problems after eval so it's not an issue with divergence.
Thanks!
The text was updated successfully, but these errors were encountered:
What environment should I use? The environment in the comment you linked differs from what was suggested in the readme of this repo.
I don't think torch/d2 versions are the cause of this: running evaluation on downloaded weights gives predictions and results are in line with reported numbers.
Hi,
I'm trying to train a detection model with the plain ViT backbone on 8 GPUs (by scaling down batch size + lr 4x) using the 100 epoch config. Training seems to progress nicely until evaluation, at which point I get the following log statements:
Has anyone else seen this before? Training continues without any apparent problems after eval so it's not an issue with divergence.
Thanks!
The text was updated successfully, but these errors were encountered: