Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError: float division by zero #52

Open
xingkongliang opened this issue Dec 29, 2021 · 0 comments
Open

ZeroDivisionError: float division by zero #52

xingkongliang opened this issue Dec 29, 2021 · 0 comments

Comments

@xingkongliang
Copy link

Describe the bug

2021-12-29 08:46:30.550 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
2021-12-29 08:46:30.551 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
2021-12-29 08:46:31.863 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
2021-12-29 08:46:31.863 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
2021-12-29 08:46:31.864 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
2021-12-29 08:46:31.864 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
2021-12-29 08:46:31.864 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
2021-12-29 08:46:31.864 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
2021-12-29 08:46:31.864 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
2021-12-29 08:46:31.865 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0

2021-12-29 08:46:39.965 File "./tools/train.py", line 188, in <module>
2021-12-29 08:46:39.965 main()
2021-12-29 08:46:39.965 File "./tools/train.py", line 177, in main
2021-12-29 08:46:39.965 train_detector(
2021-12-29 08:46:39.965 File "mmdet/apis/train.py", line 186, in train_detector
2021-12-29 08:46:39.965 runner.run(data_loaders, cfg.workflow)
2021-12-29 08:46:39.965 File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
2021-12-29 08:46:39.965 epoch_runner(data_loaders[i], **kwargs)
2021-12-29 08:46:39.965 File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
2021-12-29 08:46:39.965 self.call_hook('after_train_iter')
2021-12-29 08:46:39.965 File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
2021-12-29 08:46:39.966 getattr(hook, fn_name)(self)
2021-12-29 08:46:39.966 File "mmdet/utils/optimizer.py", line 26, in after_train_iter
2021-12-29 08:46:39.966 scaled_loss.backward()
2021-12-29 08:46:39.966 File "/opt/conda/lib/python3.8/contextlib.py", line 120, in __exit__
2021-12-29 08:46:39.966 next(self.gen)
2021-12-29 08:46:39.966 File "/opt/conda/lib/python3.8/site-packages/apex/amp/handle.py", line 123, in scale_loss
2021-12-29 08:46:39.966 optimizer._post_amp_backward(loss_scaler)
2021-12-29 08:46:39.966 File "/opt/conda/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
2021-12-29 08:46:39.966 post_backward_models_are_masters(scaler, params, stashed_grads)
2021-12-29 08:46:39.966 File "/opt/conda/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 131, in post_backward_models_are_masters
2021-12-29 08:46:39.966 scaler.unscale_with_stashed(
2021-12-29 08:46:39.966 File "/opt/conda/lib/python3.8/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed
2021-12-29 08:46:39.966 out_scale/grads_have_scale, # 1./scale,
2021-12-29 08:46:39.966 ZeroDivisionError: float division by zero

Reproduction

  1. What command or script did you run?
    run configs/cbnet/htc_cbv2_swin_base_patch4_window7_mstrain_400-1400_giou_4conv1f_adamw_20e_coco.py

  2. Did you make any modifications on the code or config? Did you understand what you have modified?
    no

  3. What dataset did you use?
    coco

Environment

2021-12-28 02:35:44,932 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.5 (default, Sep  4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
PyTorch: 1.7.0+cu110
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80
  - CuDNN 8.0.4
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.8.1+cu110
OpenCV: 4.5.1
MMCV: 1.3.18
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.0
MMDetection: 2.14.0+
------------------------------------------------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant