New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fp16的支持问题 #41

Open

XUWeijiang opened this issue Oct 9, 2023 · 1 comment

XUWeijiang commented Oct 9, 2023

因为现在手头只有v100的机器，所以训练的时候尝试用了fp16（bf16有点慢）。

但是发现用fp16实质上似乎没有训练，

Megatron-LLaMA/megatron/optimizer/optimizer.py

Line 433 in 25306de

if found_inf_flag:

这一行判断一直为True，也就是找到了inf/nan，导致训练不下去。

同样的数据集bf16的情况我跑过，没有这个问题。我也修改--initial-loss-scale到一个比较小的值也不行。

Collaborator

li-yi-dong commented Oct 10, 2023

抱歉，fp16 验证的比较少，我们近期会看看。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment