-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
提示一直显示任务分发,估计有个8分钟左右然后就停了 #1893
Comments
不再提交新的issue,附在这里,我用的是贵阳一,也是按上面步骤操作。 观察日志,shceduler.log里有timeout, worker_1.log中有TypeError worker_0.log中在开始附近有很多似乎是动态库的错误。 是不是系统软件和硬件配合问题? |
看起来CANN版本不匹配 |
Describe the bug/ 问题描述 (Mandatory / 必填)
提示一直显示任务分发,估计有个8分钟左右然后就突然提示:[ERROR] ME(130371:281473296949264,MainProcess):2024-12-26-14:22:26.839.923 [mindspore/parallel/cluster/process_entity/_api.py:268] Worker process 130673 exit with exception.
Ascend
/GPU
/CPU
) / 硬件环境:To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:
cd mindnlp
bash scripts/build_and_reinstall.sh
cd llm
cd parallel
cd bert_imdb_finetune_dp
bash bert_imdb_finetune_npu_mindnlp_trainer.sh
pip install datasets --upgrade -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip uninstall soundfile
Expected behavior / 预期结果 (Mandatory / 必填)
应该是执行训练,但是没有直接卡主一直进行任务分发
Screenshots/ 日志 / 截图 (Mandatory / 必填)
Additional context / 备注 (Optional / [选填)](worker_0 (1).log)
有时候有一张卡是成功的
worker_1 (1).log
worker_0 (1).log
The text was updated successfully, but these errors were encountered: