Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Error assert (boxes1[:, 2:] >= boxes1[:, :2]).all() #17

Open
Kyfafyd opened this issue Mar 24, 2022 · 2 comments
Open

Training Error assert (boxes1[:, 2:] >= boxes1[:, :2]).all() #17

Kyfafyd opened this issue Mar 24, 2022 · 2 comments

Comments

@Kyfafyd
Copy link

Kyfafyd commented Mar 24, 2022

Instructions To Reproduce the 🐛 Bug:

  1. what changes you made (git diff) or what code you wrote
Nothing change
  1. what exact command you run: python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path ../data/COCO2017 --output_dir output/conddetr_r50_epoch50
  2. what you observed (including full logs):
| distributed init (rank 2): env://
| distributed init (rank 0): env://
| distributed init (rank 4): env://
| distributed init (rank 3): env://
| distributed init (rank 5): env://
| distributed init (rank 1): env://
| distributed init (rank 7): env://
| distributed init (rank 6): env://
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
git:
  sha: N/A, status: clean, branch: N/A

fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, cls_loss_coef=2, coco_panoptic_path=None, coco_path='../data/COCO2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, epochs=50, eval=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, nheads=8, num_queries=300, num_workers=2, output_dir='output/conddetr_r50_epoch50', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
number of params: 43196001
loading annotations into memory...
Done (t=20.78s)
creating index...
index created!
loading annotations into memory...
Done (t=0.56s)
creating index...
index created!
Start training
Epoch: [0]  [   0/7393]  eta: 7:05:21  lr: 0.000100  class_error: 85.57  loss: 45.1821 (45.1821)  loss_bbox: 3.7751 (3.7751)  loss_bbox_0: 3.7823 (3.7823)  loss_bbox_1: 3.7808 (3.7808)  loss_bbox_2: 3.7756 (3.7756)  loss_bbox_3: 3.7911 (3.7911)  loss_bbox_4: 3.7856 (3.7856)  loss_ce: 1.9574 (1.9574)  loss_ce_0: 2.0151 (2.0151)  loss_ce_1: 2.0196 (2.0196)  loss_ce_2: 2.1484 (2.1484)  loss_ce_3: 2.0683 (2.0683)  loss_ce_4: 2.0683 (2.0683)  loss_giou: 1.7011 (1.7011)  loss_giou_0: 1.7000 (1.7000)  loss_giou_1: 1.7040 (1.7040)  loss_giou_2: 1.7059 (1.7059)  loss_giou_3: 1.7022 (1.7022)  loss_giou_4: 1.7012 (1.7012)  cardinality_error_unscaled: 293.1250 (293.1250)  cardinality_error_0_unscaled: 293.1250 (293.1250)  cardinality_error_1_unscaled: 293.1250 (293.1250)  cardinality_error_2_unscaled: 281.9375 (281.9375)  cardinality_error_3_unscaled: 293.1250 (293.1250)  cardinality_error_4_unscaled: 293.1250 (293.1250)  class_error_unscaled: 85.5712 (85.5712)  loss_bbox_unscaled: 0.7550 (0.7550)  loss_bbox_0_unscaled: 0.7565 (0.7565)  loss_bbox_1_unscaled: 0.7562 (0.7562)  loss_bbox_2_unscaled: 0.7551 (0.7551)  loss_bbox_3_unscaled: 0.7582 (0.7582)  loss_bbox_4_unscaled: 0.7571 (0.7571)  loss_ce_unscaled: 0.9787 (0.9787)  loss_ce_0_unscaled: 1.0076 (1.0076)  loss_ce_1_unscaled: 1.0098 (1.0098)  loss_ce_2_unscaled: 1.0742 (1.0742)  loss_ce_3_unscaled: 1.0341 (1.0341)  loss_ce_4_unscaled: 1.0342 (1.0342)  loss_giou_unscaled: 0.8506 (0.8506)  loss_giou_0_unscaled: 0.8500 (0.8500)  loss_giou_1_unscaled: 0.8520 (0.8520)  loss_giou_2_unscaled: 0.8530 (0.8530)  loss_giou_3_unscaled: 0.8511 (0.8511)  loss_giou_4_unscaled: 0.8506 (0.8506)  time: 3.4521  data: 0.4687  max mem: 2932
Epoch: [0]  [ 100/7393]  eta: 1:17:39  lr: 0.000100  class_error: 85.74  loss: 28.2629 (33.7855)  loss_bbox: 1.5517 (2.3437)  loss_bbox_0: 1.5566 (2.3695)  loss_bbox_1: 1.5482 (2.3519)  loss_bbox_2: 1.5535 (2.3396)  loss_bbox_3: 1.5641 (2.3476)  loss_bbox_4: 1.5637 (2.3431)  loss_ce: 1.5467 (1.6584)  loss_ce_0: 1.5650 (1.6414)  loss_ce_1: 1.5443 (1.6461)  loss_ce_2: 1.5557 (1.6477)  loss_ce_3: 1.5392 (1.6545)  loss_ce_4: 1.5541 (1.6667)  loss_giou: 1.5534 (1.6289)  loss_giou_0: 1.5514 (1.6296)  loss_giou_1: 1.5541 (1.6292)  loss_giou_2: 1.5695 (1.6291)  loss_giou_3: 1.5526 (1.6289)  loss_giou_4: 1.5519 (1.6296)  cardinality_error_unscaled: 293.1875 (293.2420)  cardinality_error_0_unscaled: 293.1875 (293.2420)  cardinality_error_1_unscaled: 293.1875 (293.2420)  cardinality_error_2_unscaled: 293.1875 (293.1312)  cardinality_error_3_unscaled: 293.1875 (293.2420)  cardinality_error_4_unscaled: 293.1875 (293.1658)  class_error_unscaled: 75.6680 (75.4478)  loss_bbox_unscaled: 0.3103 (0.4687)  loss_bbox_0_unscaled: 0.3113 (0.4739)  loss_bbox_1_unscaled: 0.3096 (0.4704)  loss_bbox_2_unscaled: 0.3107 (0.4679)  loss_bbox_3_unscaled: 0.3128 (0.4695)  loss_bbox_4_unscaled: 0.3127 (0.4686)  loss_ce_unscaled: 0.7733 (0.8292)  loss_ce_0_unscaled: 0.7825 (0.8207)  loss_ce_1_unscaled: 0.7722 (0.8231)  loss_ce_2_unscaled: 0.7779 (0.8239)  loss_ce_3_unscaled: 0.7696 (0.8272)  loss_ce_4_unscaled: 0.7770 (0.8334)  loss_giou_unscaled: 0.7767 (0.8145)  loss_giou_0_unscaled: 0.7757 (0.8148)  loss_giou_1_unscaled: 0.7771 (0.8146)  loss_giou_2_unscaled: 0.7847 (0.8146)  loss_giou_3_unscaled: 0.7763 (0.8144)  loss_giou_4_unscaled: 0.7760 (0.8148)  time: 0.6098  data: 0.0105  max mem: 4353
Traceback (most recent call last):
  File "main.py", line 258, in <module>
    main(args)
  File "main.py", line 206, in main
    train_stats = train_one_epoch(
  File "/research/d4/gds/zwang21/ConditionalDETR/engine.py", line 41, in train_one_epoch
    loss_dict = criterion(outputs, targets)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/research/d4/gds/zwang21/ConditionalDETR/models/conditional_detr.py", line 254, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/research/d4/gds/zwang21/ConditionalDETR/models/matcher.py", line 79, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
  File "/research/d4/gds/zwang21/ConditionalDETR/util/box_ops.py", line 59, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError
Traceback (most recent call last):
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/research/d4/gds/zwang21/anaconda3/bin/python', '-u', 'main.py', '--coco_path', '../data/COCO2017', '--output_dir', 'output/conddetr_r50_epoch50']' returned non-zero exit status 1.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Killing subprocess 29668
Killing subprocess 29669
Killing subprocess 29670
Killing subprocess 29671
Killing subprocess 29672
Killing subprocess 29673
Killing subprocess 29674
Killing subprocess 29675
  1. please simplify the steps as much as possible so they do not require additional resources to
    run, such as a private dataset.

Expected behavior:

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

Environment:

Provide your environment information using the following command:

Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 11.2.0
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] numpydoc==1.1.0
[pip3] pytorch-ignite==0.2.0
[pip3] pytorch-metric-learning==0.9.99
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchfile==0.1.0
[pip3] torchsampler==0.1.1
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.2.0           h06a4308_296  
[conda] mkl-service               2.3.0            py38h27cfd23_1  
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.1            py38ha9443f7_2  
[conda] numpy                     1.22.2                   pypi_0    pypi
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1  
[conda] pytorch                   1.8.0           py3.8_cuda10.2_cudnn7.6.5_0    pytorch
[conda] pytorch-ignite            0.2.0                    pypi_0    pypi
[conda] pytorch-metric-learning   0.9.99                   pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     1.10.0                   pypi_0    pypi
[conda] torchaudio                0.8.0                      py38    pytorch
[conda] torchfile                 0.1.0                    pypi_0    pypi
[conda] torchsampler              0.1.1                    pypi_0    pypi
[conda] torchsummary              1.5.1                    pypi_0    pypi
[conda] torchvision               0.9.0                py38_cu102    pytorch
@DeppMeng
Copy link
Collaborator

DeppMeng commented Apr 1, 2022

Sorry but we never encountered this error. It indicates that the predicted boxes have a negative width or height. Which should not happen. The predicted (cx, cy, h, w) are fed into a sigmoid, so all h, w should be in range [0, 1].

@RicePasteM
Copy link

Encountered the same issue, and solved by only using 4 GPUs. Maybe it's caused by AMP or the internal bug of distributed training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants