RuntimeError: Exception thrown from user defined Python function in dataset. #370

waangyuhai · 2024-10-12T13:30:12Z

RuntimeError: Exception thrown from user defined Python function in dataset.

Hardy-Chung · 2024-10-14T02:08:57Z

我有相同的报错

问题描述

MindYOLO使用官方SHWD数据集报错，使用coco2017数据集训练也类似错误

配置

linux系统
4090芯片
CUDA11.6
MindSpore 2.3.1
MindYOLO 0.4.0

错误日志

Traceback (most recent call last):
File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 320, in <module>
train(args)
File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 275, in train
trainer.train(
File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 153, in train
for i, data in enumerate(loader):
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 152, in __next__
data = self._get_next()
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 277, in _get_next
raise err
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 260, in _get_next
return {k: self._transform_md_to_output(t) for k, t in self._iterator.GetNextAsMap().items()}
RuntimeError: Exception thrown from user defined Python function in dataset.

------------------------------------------------------------------
- Python Call Stack:
------------------------------------------------------------------
Traceback (most recent call last):
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/datasets_user_defined.py", line 312, in process
result.reraise()
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/core/py_util_helpers.py", line 65, in reraise
raise self.except_type(err_msg)
ValueError: Traceback (most recent call last):
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/datasets_user_defined.py", line 539, in _generator_worker_loop
result = dataset[idx]
File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/data/dataset.py", line 313, in __getitem__
sample = getattr(self, func_name)(sample, **_trans)
File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/data/dataset.py", line 1076, in label_pad
cls_pad[:min(nL, padding_size)] = cls[:min(nL, padding_size)]
ValueError: could not broadcast input array from shape (15,) into shape (15,1)

------------------------------------------------------------------
- Dataset Pipeline Error Message:
------------------------------------------------------------------
[ERROR] Execute user Python code failed, check 'Python Call Stack' above.

------------------------------------------------------------------
- C++ Call Stack: (For framework developers)
------------------------------------------------------------------
mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc(261).

训练命令

python train.py --config /data/zhongzhijia/code/mindspore/mindyolo/configs/shwd_yolov8n.yaml --device_target=GPU

shwd_yolov8n.yaml

__BASE__: [
  './yolov8/yolov8n.yaml',
]

per_batch_size: 4 # 单卡batchsize，总的batchsize=per_batch_size * device_num
img_size: 640 # image sizes
weight: /data/zhongzhijia/code/mindspore/mindyolo/configs/yolov8-n_500e_mAP372-cc07f5bd.ckpt
strict_load: False # 是否按严格加载ckpt内参数，默认True，若设成False，当分类数不一致，丢掉最后一层分类器的weight
log_interval: 10 # 每log_interval次迭代打印一次loss结果

data:
  dataset_name: shwd
  train_set: /data/zhongzhijia/data/SHWD/train.txt # 实际训练数据路径
  val_set: /data/zhongzhijia/data/SHWD/val.txt
  test_set: /data/zhongzhijia/data/SHWD/val.txt
  nc: 2 # 分类数
  # class names
  names: [ 'person',  'hat' ] # 每一类的名字

optimizer:
  lr_init: 0.001  # initial learning rate

XinhaoLuo666 · 2024-10-16T10:14:39Z

貌似是mindYOLO读取数据集时出现维度错误，我在dataset.py ，label_pad函数中为cls添加了一个维度，解决了这个报错(那一行)

    cls, bboxes = sample['cls'], sample['bboxes']
    cls = np.expand_dims(cls, axis=1) #111111111 这里加一个维度
    cls_pad = np.full((padding_size, 1), padding_value, dtype=np.float32)
    bboxes_pad = np.full((padding_size, 4), padding_value, dtype=np.float32)
    nL = len(bboxes)

waangyuhai · 2024-10-16T10:33:44Z

感谢您的帮助，我去试试

Hardy-Chung · 2024-10-18T07:54:28Z

貌似是mindYOLO读取数据集时出现维度错误，我在dataset.py ，label_pad函数中为cls添加了一个维度，解决了这个报错(那一行)
    cls, bboxes = sample['cls'], sample['bboxes']
    cls = np.expand_dims(cls, axis=1) #111111111 这里加一个维度
    cls_pad = np.full((padding_size, 1), padding_value, dtype=np.float32)
    bboxes_pad = np.full((padding_size, 4), padding_value, dtype=np.float32)
    nL = len(bboxes)

前进一小步，训练中报错

2024-10-18 13:44:08,594 [INFO] Epoch 2/500, Step 570/6064, step time: 35.22 ms
2024-10-18 13:44:08,934 [INFO] Epoch 2/500, Step 580/6064, imgsize (640, 640), loss: 11.6492, lbox: 4.8665, lcls: 3.8684, dfl: 2.9143, cur_lr: 0.06384294480085373
2024-10-18 13:44:08,936 [INFO] Epoch 2/500, Step 580/6064, step time: 34.12 ms
2024-10-18 13:44:09,292 [INFO] Epoch 2/500, Step 590/6064, imgsize (640, 640), loss: 12.3432, lbox: 4.9079, lcls: 4.5909, dfl: 2.8444, cur_lr: 0.0637885257601738
2024-10-18 13:44:09,293 [INFO] Epoch 2/500, Step 590/6064, step time: 35.64 ms
[ERROR] RUNTIME_FRAMEWORK(68638,7fea4cbaa700,python):2024-10-18-13:44:09.606.187 [mindspore/ccsrc/runtime/graph_scheduler/actor/actor_common.cc:327] WaitRuntimePipelineFinish] Wait runtime pipeline finish and an error occurred: 
----------------------------------------------------
- cuDNN Error:
----------------------------------------------------
Kernel launch failed | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/nn/batch_norm_grad_gpu_kernel.cc:207 LaunchKernel

Traceback (most recent call last):
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 320, in <module>
    train(args)
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 275, in train
    trainer.train(
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train
    run_context.loss, run_context.lr = self.train_step(imgs, labels, segments,
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 366, in train_step
    loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 941, in staging_specialize
    out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 185, in wrapper
    results = fn(*arg, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 572, in __call__
    output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: 
----------------------------------------------------
- cuDNN Error:
----------------------------------------------------
Kernel launch failed | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/nn/batch_norm_grad_gpu_kernel.cc:207 LaunchKernel

[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:14.929.981 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:209] SyncStream] cudaStreamSynchronize failed, ret[710], device-side assert triggered
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:14.930.026 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:211] SyncStream] The kernel name and backtrace in log might be incorrect, since CUDA error might be asynchronously reported at some other function call. Please exporting CUDA_LAUNCH_BLOCKING=1 for more accurate error positioning.
[ERROR] ME(68638,7fec40543740,python):2024-10-18-13:45:14.930.045 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:490] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.020.568 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:200] DestroyStream] cudaStreamDestroy failed, ret[710], device-side assert triggered
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.020.595 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:70] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.021.051 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:77] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.022.161 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(68638,7fec40543740,python):2024-10-18-13:45:15.022.169 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7fea02000000] error.
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.022.184 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(68638,7fec40543740,python):2024-10-18-13:45:15.022.189 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7fe93c000000] error.

这个是内存不够吗？我已经将per_batch_size设为1了

WongGawa · 2024-10-18T09:03:15Z

@Hardy-Chung @waangyuhai 感谢反馈。可以反馈一下所使用的albumentations库版本么？
目前已知下载依赖的albumentations库版本1.4.18的输出结果与先前版本有差异，已通过该#376 pr合入修改。
或者推荐使用albumentations==1.3.1或者1.4.0至1.4.4直接的版本直接使用。

Hardy-Chung · 2024-10-18T09:08:02Z

@Hardy-Chung @waangyuhai 感谢反馈。可以反馈一下所使用的albumentations库版本么？目前已知下载依赖的albumentations库版本1.4.18的输出结果与先前版本有差异，已通过该#376 pr合入修改。或者推荐使用albumentations==1.3.1或者1.4.0至1.4.4直接的版本直接使用。

确实是 1.4.18 @WongGawa

This was referenced Oct 18, 2024

fix difference of high version albumentations #375

Merged

fix difference bewtween high version albumentations #376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Exception thrown from user defined Python function in dataset. #370

RuntimeError: Exception thrown from user defined Python function in dataset. #370

waangyuhai commented Oct 12, 2024

Hardy-Chung commented Oct 14, 2024 •

edited

Loading

XinhaoLuo666 commented Oct 16, 2024

waangyuhai commented Oct 16, 2024

Hardy-Chung commented Oct 18, 2024

WongGawa commented Oct 18, 2024 •

edited

Loading

Hardy-Chung commented Oct 18, 2024 •

edited

Loading

RuntimeError: Exception thrown from user defined Python function in dataset. #370

RuntimeError: Exception thrown from user defined Python function in dataset. #370

Comments

waangyuhai commented Oct 12, 2024

Hardy-Chung commented Oct 14, 2024 • edited Loading

问题描述

配置

错误日志

训练命令

shwd_yolov8n.yaml

XinhaoLuo666 commented Oct 16, 2024

waangyuhai commented Oct 16, 2024

Hardy-Chung commented Oct 18, 2024

WongGawa commented Oct 18, 2024 • edited Loading

Hardy-Chung commented Oct 18, 2024 • edited Loading

Hardy-Chung commented Oct 14, 2024 •

edited

Loading

WongGawa commented Oct 18, 2024 •

edited

Loading

Hardy-Chung commented Oct 18, 2024 •

edited

Loading