Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion input_val >= zero && input_val <= one failed. #60

Open
YU2024DW opened this issue Sep 3, 2024 · 1 comment
Open

Assertion input_val >= zero && input_val <= one failed. #60

YU2024DW opened this issue Sep 3, 2024 · 1 comment

Comments

@YU2024DW
Copy link

YU2024DW commented Sep 3, 2024

hi @magehrig
When I am training the base model of RVT, it suddenly throws this error when running into the fourth epoch. I have tried many times, and it always happens. How can I resolve this? (The tiny and small models run well.)

Epoch 3: : 67190it [6:05:52, 3.06it/s, loss=2.15, v_num=m9eu]/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [32,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [33,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [35,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [37,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [38,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [39,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [40,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [43,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [44,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [45,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [46,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [47,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [48,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [49,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [5,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [9,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [10,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [11,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [13,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [20,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [21,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [22,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [24,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one failed.
Error executing job with overrides: ['model=rnndet', 'dataset=gen1', 'dataset.path=/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-master/data_dir', 'wandb.project_name=RVT', 'wandb.group_name=gen1', '+experiment/gen1=base.yaml', 'hardware.gpus=1', 'batch_size.train=8', 'batch_size.eval=8', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2']
Traceback (most recent call last):
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step
closure_result = closure()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call
self._result = self.closure(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
step_output = self._step_fn()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
return self.model.training_step(*args, **kwargs)
File "/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-new/RVT-master/modules/detection.py", line 166, in training_step
predictions, losses = self.mdl.forward_detect(backbone_features=selected_backbone_features,
File "/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-new/RVT-master/models/detection/yolox_extension/models/detector.py", line 53, in forward_detect
outputs, losses = self.yolox_head(fpn_features, targets)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-new/RVT-master/models/detection/yolox/models/yolo_head.py", line 220, in forward
losses = self.get_losses(
File "/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-new/RVT-master/models/detection/yolox/models/yolo_head.py", line 345, in get_losses
) = self.get_assignments( # noqa
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-new/RVT-master/models/detection/yolox/models/yolo_head.py", line 526, in get_assignments
) = self.simota_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask)
File "/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-new/RVT-master/models/detection/yolox/models/yolo_head.py", line 581, in simota_matching
_, pos_idx = torch.topk(
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/media/pe/5fe0ba86-cd64-483b-bfc5-dd83088ea652/projects/zxy/RVT-new/RVT-master/train.py", line 139, in main
trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt
trainer._teardown()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1161, in _teardown
self.strategy.teardown()
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 492, in teardown
_optimizers_to_device(self.optimizers, torch.device("cpu"))
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/optimizer.py", line 28, in _optimizers_to_device
_optimizer_to_device(opt, device)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/optimizer.py", line 34, in _optimizer_to_device
optimizer.state[p] = apply_to_collection(v, Tensor, move_data_to_device, device)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/apply_func.py", line 70, in apply_to_collection
return {k: function(v, *args, **kwargs) for k, v in data.items()}
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/apply_func.py", line 70, in
return {k: function(v, *args, **kwargs) for k, v in data.items()}
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/apply_func.py", line 101, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/pe/anaconda3/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/apply_func.py", line 95, in batch_to
data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Looking forward to your reply, thank you very much.

@magehrig
Copy link
Contributor

I have never encountered this. Can you provide the list of installed packages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants