提示一直显示任务分发，估计有个8分钟左右然后就停了 #1893

stevenchendy · 2024-12-26T17:07:55Z

Describe the bug/ 问题描述 (Mandatory / 必填)
提示一直显示任务分发，估计有个8分钟左右然后就突然提示：[ERROR] ME(130371:281473296949264,MainProcess):2024-12-26-14:22:26.839.923 [mindspore/parallel/cluster/process_entity/_api.py:268] Worker process 130673 exit with exception.

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

pip install --upgrade mindspore
pip install https://repo.mindspore.cn/mindspore-lab/mindnlp/newest/any/mindnlp-0.4.1-py3-none-any.whl
pip uninstall mindformers
git clone https://openi.pcl.ac.cn/lvyufeng/mindnlp
cd mindnlp
bash scripts/build_and_reinstall.sh
cd llm
cd parallel
cd bert_imdb_finetune_dp
bash bert_imdb_finetune_npu_mindnlp_trainer.sh
pip install datasets --upgrade -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip uninstall soundfile
Expected behavior / 预期结果 (Mandatory / 必填)
应该是执行训练，但是没有直接卡主一直进行任务分发

Screenshots/ 日志 / 截图 (Mandatory / 必填)

Additional context / 备注 (Optional / [选填)](worker_0 (1).log)
有时候有一张卡是成功的

worker_1 (1).log
worker_0 (1).log

The text was updated successfully, but these errors were encountered:

jiamn · 2024-12-28T04:47:36Z

不再提交新的issue，附在这里，我用的是贵阳一，也是按上面步骤操作。
用 python bert_imdb_finetune_cpu_mindnlp_trainer_npus_same.py 可以训练成功，而且能看到NPU1个有使用，
但是用 bash bert_imdb_finetune_npu_mindnlp_trainer.sh 看到两个NPU都没拉起。

观察日志，shceduler.log里有timeout，
RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{0}, worker 0 is the first one timed out, please check its log.

worker_1.log中有TypeError
[INFO] GE(152245,python):2024-12-27-16:48:31.612.184 [model_v2_executor_builder.cc:179][EVENT]153739 Build:[GEPERFTRACE] The time cost of ModelV2ExecutorBuilderBuild::All is [1478] micro second.
Traceback (most recent call last):
File "/home/ma-user/work/mindnlp/llm/parallel/bert_imdb_finetune_dp/bert_imdb_finetune_cpu_mindnlp_trainer_npus_same.py", line 83, in
main()
File "/home/ma-user/work/mindnlp/llm/parallel/bert_imdb_finetune_dp/bert_imdb_finetune_cpu_mindnlp_trainer_npus_same.py", line 80, in main
trainer.train()
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 781, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 1133, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 1425, in training_step
self.update_gradient_by_distributed_type(model)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 1390, in update_gradient_by_distributed_type
new_grads_mean = all_reduce(parameter.grad) / rank_size
TypeError: unsupported operand type(s) for /: 'tuple' and 'int'
[INFO] HCCP(152245,python):2024-12-27-16:48:31.668.579 [ra_host.c:846]tid:153739,ra_socket_batch_connect(846) : Input parameters: [0]th, phy_id[1], local_ip[1.0.0.0], remote_ip[0.0.0.0], tag:

worker_0.log中在开始附近有很多似乎是动态库的错误。
[mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclmdlBundleUnload failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclmdlBundleUnload
[WARNING] GE_ADPT(152233,ffffa5c57010,python):2024-12-27-16:47:36.238.837 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclrtGetMemUceInfo failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclrtGetMemUceInfo
[WARNING] GE_ADPT(152233,ffffa5c57010,python):2024-12-27-16:47:36.238.857 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclrtDeviceTaskAbort failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclrtDeviceTaskAbort
[WARNING] GE_ADPT(152233,ffffa5c57010,python):2024-12-27-16:47:36.238.875

是不是系统软件和硬件配合问题？

附log
scheduler.log
worker_0.log
worker_1.log

lvyufeng · 2025-01-02T12:55:05Z

看起来CANN版本不匹配

stevenchendy added the bug Something isn't working label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

提示一直显示任务分发，估计有个8分钟左右然后就停了 #1893

提示一直显示任务分发，估计有个8分钟左右然后就停了 #1893

stevenchendy commented Dec 26, 2024

jiamn commented Dec 28, 2024

lvyufeng commented Jan 2, 2025

提示一直显示任务分发，估计有个8分钟左右然后就停了 #1893

提示一直显示任务分发，估计有个8分钟左右然后就停了 #1893

Comments

stevenchendy commented Dec 26, 2024

jiamn commented Dec 28, 2024

lvyufeng commented Jan 2, 2025