Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kunlunxin] transformer model, fix running error #337

Merged
merged 6 commits into from
Dec 15, 2023

Conversation

chenrui9312
Copy link
Contributor

xacc args
install dlloger
compatibility with newer numpy

@yuzhou03
Copy link
Contributor

yuzhou03 commented Nov 23, 2023

未在大群通知. --【已通知】

@yuzhou03 yuzhou03 changed the title transformer model, fix running error 【昆仑芯】transformer model, fix running error Nov 24, 2023
@yuzhou03 yuzhou03 changed the title 【昆仑芯】transformer model, fix running error [kunlunxin] transformer model, fix running error Nov 24, 2023
@yuzhou03
Copy link
Contributor

yuzhou03 commented Nov 24, 2023

test_conf.py 请添加 运行示例

@yuzhou03
Copy link
Contributor

yuzhou03 commented Nov 24, 2023

请提交昆仑芯 1x1、2x8配置 及 训练日志。

厂商提交的训练日志要求

  1. 1x8跑到 收敛 或 接近收敛(final acc出现gap,请和智源沟通gap是否可以接受)。
  2. 1x1, 2x8 跑2h左右,要求loss能正常下降,loss曲线与标准case基本一致。输出的finished_info中包含性能数据(p_whole, p_train, p_core)。 无需跑到收敛。

@yuzhou03
Copy link
Contributor

1x8 【OK】

image

rebuild new image
image

@yuzhou03
Copy link
Contributor

1x1 【OK】
image

rebuild new image
image

@chenrui9312 chenrui9312 force-pushed the main branch 2 times, most recently from d6dc710 to d095ea4 Compare December 6, 2023 09:07
@chenrui9312
Copy link
Contributor Author

chenrui9312 commented Dec 7, 2023

transformer xpu1x1 log:node-006:/data/chenrui/workspace/FlagPerf/training/result/run20231206001719
transformer xpu2x8 log:node-006:/data/chenrui/workspace/FlagPerf/training/result/run20231206145428

【已存档】

@chenrui9312
Copy link
Contributor Author

chenrui9312 commented Dec 7, 2023

transformer xpu1x8 log:node-006:/data/chenrui/workspace/FlagPerf/training/result/run20231207104129

【已存档】

@yuzhou03
Copy link
Contributor

yuzhou03 commented Dec 7, 2023

更新case readme

@yuzhou03
Copy link
Contributor

yuzhou03 commented Dec 7, 2023

解决代码冲突

@yuzhou03
Copy link
Contributor

yuzhou03 commented Dec 8, 2023

2x8 训练启动【OK】

image

image

image

训练45min之后,出现Kernel Exception,以上问题在更新到commit:a6b3f6f93db80c9c406b16cc633645f3714f901a
之后,已解决【OK】
image

image

@yuzhou03
Copy link
Contributor

重推代码后的 1x8 rebuld iamge & start container【OK】
image

image

@yuzhou03
Copy link
Contributor

yuzhou03 commented Dec 14, 2023

重推代码后的 1x1 rebuld iamge & start container【OK】
image
image

@yuzhou03 yuzhou03 merged commit 25d4d36 into FlagOpen:main Dec 15, 2023
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants