Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Exception thrown from user defined Python function in dataset. #370

Open
waangyuhai opened this issue Oct 12, 2024 · 6 comments

Comments

@waangyuhai
Copy link

RuntimeError: Exception thrown from user defined Python function in dataset.

@Hardy-Chung
Copy link

Hardy-Chung commented Oct 14, 2024

我有相同的报错

问题描述

MindYOLO使用官方SHWD数据集报错,使用coco2017数据集训练也类似错误

配置

linux系统
4090芯片
CUDA11.6
MindSpore 2.3.1
MindYOLO 0.4.0

错误日志

Traceback (most recent call last):
File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 320, in <module>
train(args)
File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 275, in train
trainer.train(
File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 153, in train
for i, data in enumerate(loader):
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 152, in __next__
data = self._get_next()
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 277, in _get_next
raise err
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 260, in _get_next
return {k: self._transform_md_to_output(t) for k, t in self._iterator.GetNextAsMap().items()}
RuntimeError: Exception thrown from user defined Python function in dataset.

------------------------------------------------------------------
- Python Call Stack:
------------------------------------------------------------------
Traceback (most recent call last):
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/datasets_user_defined.py", line 312, in process
result.reraise()
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/core/py_util_helpers.py", line 65, in reraise
raise self.except_type(err_msg)
ValueError: Traceback (most recent call last):
File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/dataset/engine/datasets_user_defined.py", line 539, in _generator_worker_loop
result = dataset[idx]
File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/data/dataset.py", line 313, in __getitem__
sample = getattr(self, func_name)(sample, **_trans)
File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/data/dataset.py", line 1076, in label_pad
cls_pad[:min(nL, padding_size)] = cls[:min(nL, padding_size)]
ValueError: could not broadcast input array from shape (15,) into shape (15,1)

------------------------------------------------------------------
- Dataset Pipeline Error Message:
------------------------------------------------------------------
[ERROR] Execute user Python code failed, check 'Python Call Stack' above.

------------------------------------------------------------------
- C++ Call Stack: (For framework developers)
------------------------------------------------------------------
mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc(261).

训练命令

python train.py --config /data/zhongzhijia/code/mindspore/mindyolo/configs/shwd_yolov8n.yaml --device_target=GPU

shwd_yolov8n.yaml

__BASE__: [
  './yolov8/yolov8n.yaml',
]

per_batch_size: 4 # 单卡batchsize,总的batchsize=per_batch_size * device_num
img_size: 640 # image sizes
weight: /data/zhongzhijia/code/mindspore/mindyolo/configs/yolov8-n_500e_mAP372-cc07f5bd.ckpt
strict_load: False # 是否按严格加载ckpt内参数,默认True,若设成False,当分类数不一致,丢掉最后一层分类器的weight
log_interval: 10 # 每log_interval次迭代打印一次loss结果

data:
  dataset_name: shwd
  train_set: /data/zhongzhijia/data/SHWD/train.txt # 实际训练数据路径
  val_set: /data/zhongzhijia/data/SHWD/val.txt
  test_set: /data/zhongzhijia/data/SHWD/val.txt
  nc: 2 # 分类数
  # class names
  names: [ 'person',  'hat' ] # 每一类的名字

optimizer:
  lr_init: 0.001  # initial learning rate

@XinhaoLuo666
Copy link

貌似是mindYOLO读取数据集时出现维度错误,我在dataset.py ,label_pad函数中为cls添加了一个维度,解决了这个报错(那一行)

    cls, bboxes = sample['cls'], sample['bboxes']
    cls = np.expand_dims(cls, axis=1) #111111111 这里加一个维度
    cls_pad = np.full((padding_size, 1), padding_value, dtype=np.float32)
    bboxes_pad = np.full((padding_size, 4), padding_value, dtype=np.float32)
    nL = len(bboxes)

@waangyuhai
Copy link
Author

感谢您的帮助,我去试试

@Hardy-Chung
Copy link

貌似是mindYOLO读取数据集时出现维度错误,我在dataset.py ,label_pad函数中为cls添加了一个维度,解决了这个报错(那一行)

    cls, bboxes = sample['cls'], sample['bboxes']
    cls = np.expand_dims(cls, axis=1) #111111111 这里加一个维度
    cls_pad = np.full((padding_size, 1), padding_value, dtype=np.float32)
    bboxes_pad = np.full((padding_size, 4), padding_value, dtype=np.float32)
    nL = len(bboxes)

前进一小步,训练中报错

2024-10-18 13:44:08,594 [INFO] Epoch 2/500, Step 570/6064, step time: 35.22 ms
2024-10-18 13:44:08,934 [INFO] Epoch 2/500, Step 580/6064, imgsize (640, 640), loss: 11.6492, lbox: 4.8665, lcls: 3.8684, dfl: 2.9143, cur_lr: 0.06384294480085373
2024-10-18 13:44:08,936 [INFO] Epoch 2/500, Step 580/6064, step time: 34.12 ms
2024-10-18 13:44:09,292 [INFO] Epoch 2/500, Step 590/6064, imgsize (640, 640), loss: 12.3432, lbox: 4.9079, lcls: 4.5909, dfl: 2.8444, cur_lr: 0.0637885257601738
2024-10-18 13:44:09,293 [INFO] Epoch 2/500, Step 590/6064, step time: 35.64 ms
[ERROR] RUNTIME_FRAMEWORK(68638,7fea4cbaa700,python):2024-10-18-13:44:09.606.187 [mindspore/ccsrc/runtime/graph_scheduler/actor/actor_common.cc:327] WaitRuntimePipelineFinish] Wait runtime pipeline finish and an error occurred: 
----------------------------------------------------
- cuDNN Error:
----------------------------------------------------
Kernel launch failed | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/nn/batch_norm_grad_gpu_kernel.cc:207 LaunchKernel

Traceback (most recent call last):
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 320, in <module>
    train(args)
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 275, in train
    trainer.train(
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train
    run_context.loss, run_context.lr = self.train_step(imgs, labels, segments,
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 366, in train_step
    loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 941, in staging_specialize
    out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 185, in wrapper
    results = fn(*arg, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 572, in __call__
    output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: 
----------------------------------------------------
- cuDNN Error:
----------------------------------------------------
Kernel launch failed | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/nn/batch_norm_grad_gpu_kernel.cc:207 LaunchKernel

[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:14.929.981 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:209] SyncStream] cudaStreamSynchronize failed, ret[710], device-side assert triggered
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:14.930.026 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:211] SyncStream] The kernel name and backtrace in log might be incorrect, since CUDA error might be asynchronously reported at some other function call. Please exporting CUDA_LAUNCH_BLOCKING=1 for more accurate error positioning.
[ERROR] ME(68638,7fec40543740,python):2024-10-18-13:45:14.930.045 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:490] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.020.568 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:200] DestroyStream] cudaStreamDestroy failed, ret[710], device-side assert triggered
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.020.595 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:70] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.021.051 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:77] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.022.161 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(68638,7fec40543740,python):2024-10-18-13:45:15.022.169 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7fea02000000] error.
[ERROR] DEVICE(68638,7fec40543740,python):2024-10-18-13:45:15.022.184 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(68638,7fec40543740,python):2024-10-18-13:45:15.022.189 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7fe93c000000] error.

这个是内存不够吗? 我已经将per_batch_size设为1了

@WongGawa
Copy link
Collaborator

WongGawa commented Oct 18, 2024

@Hardy-Chung @waangyuhai 感谢反馈。可以反馈一下所使用的albumentations库版本么?
目前已知下载依赖的albumentations库版本1.4.18的输出结果与先前版本有差异,已通过该#376 pr合入修改。
或者推荐使用albumentations==1.3.1或者1.4.0至1.4.4直接的版本直接使用。

@Hardy-Chung
Copy link

Hardy-Chung commented Oct 18, 2024

@Hardy-Chung @waangyuhai 感谢反馈。可以反馈一下所使用的albumentations库版本么? 目前已知下载依赖的albumentations库版本1.4.18的输出结果与先前版本有差异,已通过该#376 pr合入修改。 或者推荐使用albumentations==1.3.1或者1.4.0至1.4.4直接的版本直接使用。

确实是 1.4.18 @WongGawa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants