Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

提示一直显示任务分发,估计有个8分钟左右然后就停了 #1893

Open
stevenchendy opened this issue Dec 26, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@stevenchendy
Copy link

Describe the bug/ 问题描述 (Mandatory / 必填)
提示一直显示任务分发,估计有个8分钟左右然后就突然提示:[ERROR] ME(130371:281473296949264,MainProcess):2024-12-26-14:22:26.839.923 [mindspore/parallel/cluster/process_entity/_api.py:268] Worker process 130673 exit with exception.

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

image

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

  1. pip install --upgrade mindspore
  2. pip install https://repo.mindspore.cn/mindspore-lab/mindnlp/newest/any/mindnlp-0.4.1-py3-none-any.whl
  3. pip uninstall mindformers
  4. git clone https://openi.pcl.ac.cn/lvyufeng/mindnlp
    cd mindnlp
    bash scripts/build_and_reinstall.sh
    cd llm
    cd parallel
    cd bert_imdb_finetune_dp
    bash bert_imdb_finetune_npu_mindnlp_trainer.sh
    pip install datasets --upgrade -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
    pip uninstall soundfile
    Expected behavior / 预期结果 (Mandatory / 必填)
    应该是执行训练,但是没有直接卡主一直进行任务分发

Screenshots/ 日志 / 截图 (Mandatory / 必填)
8b2fbada3e8fff77fe534fdf0abd049
58d3fdc9ccba4837521d4004303758b

0a8eaf6065a87b2452d7a1b291d82f7

Additional context / 备注 (Optional / [选填)](worker_0 (1).log)
有时候有一张卡是成功的
3940f96b2b48c2d07ec46e8d962de00

worker_1 (1).log
worker_0 (1).log

@stevenchendy stevenchendy added the bug Something isn't working label Dec 26, 2024
@jiamn
Copy link

jiamn commented Dec 28, 2024

不再提交新的issue,附在这里,我用的是贵阳一,也是按上面步骤操作。
用 python bert_imdb_finetune_cpu_mindnlp_trainer_npus_same.py 可以训练成功,而且能看到NPU1个有使用,
但是用 bash bert_imdb_finetune_npu_mindnlp_trainer.sh 看到两个NPU都没拉起。

观察日志,shceduler.log里有timeout,
RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{0}, worker 0 is the first one timed out, please check its log.

worker_1.log中有TypeError
[INFO] GE(152245,python):2024-12-27-16:48:31.612.184 [model_v2_executor_builder.cc:179][EVENT]153739 Build:[GEPERFTRACE] The time cost of ModelV2ExecutorBuilderBuild::All is [1478] micro second.
Traceback (most recent call last):
File "/home/ma-user/work/mindnlp/llm/parallel/bert_imdb_finetune_dp/bert_imdb_finetune_cpu_mindnlp_trainer_npus_same.py", line 83, in
main()
File "/home/ma-user/work/mindnlp/llm/parallel/bert_imdb_finetune_dp/bert_imdb_finetune_cpu_mindnlp_trainer_npus_same.py", line 80, in main
trainer.train()
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 781, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 1133, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 1425, in training_step
self.update_gradient_by_distributed_type(model)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/engine/trainer/base.py", line 1390, in update_gradient_by_distributed_type
new_grads_mean = all_reduce(parameter.grad) / rank_size
TypeError: unsupported operand type(s) for /: 'tuple' and 'int'
[INFO] HCCP(152245,python):2024-12-27-16:48:31.668.579 [ra_host.c:846]tid:153739,ra_socket_batch_connect(846) : Input parameters: [0]th, phy_id[1], local_ip[1.0.0.0], remote_ip[0.0.0.0], tag:

worker_0.log中在开始附近有很多似乎是动态库的错误。
[mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclmdlBundleUnload failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclmdlBundleUnload
[WARNING] GE_ADPT(152233,ffffa5c57010,python):2024-12-27-16:47:36.238.837 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclrtGetMemUceInfo failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclrtGetMemUceInfo
[WARNING] GE_ADPT(152233,ffffa5c57010,python):2024-12-27-16:47:36.238.857 [mindspore/ccsrc/utils/dlopen_macro.h:163] DlsymAscend] Dynamically load symbol aclrtDeviceTaskAbort failed, result = /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so: undefined symbol: aclrtDeviceTaskAbort
[WARNING] GE_ADPT(152233,ffffa5c57010,python):2024-12-27-16:47:36.238.875

是不是系统软件和硬件配合问题?

附log
scheduler.log
worker_0.log
worker_1.log

@lvyufeng
Copy link
Collaborator

lvyufeng commented Jan 2, 2025

看起来CANN版本不匹配

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants