We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,我在尝试复现模型的的训练过程。我在A6000和H100上都试验过,发现在过了几个更新后就会突然显存溢出。请问这个代码在训练的过程中,请问你们在开发的过程中是否遇到了相似的问题?谢谢!
The text was updated successfully, but these errors were encountered:
@LILIXIYA 您好,A6000和H100是可以训练的,不过可能需要多张卡,最好用两卡或4卡以上的机器上进行训练。如果显存还是不够,可以尝试用lora或者采用Zero3的并行策略减少显存使用,还可以减小batch_size。
Sorry, something went wrong.
No branches or pull requests
您好,我在尝试复现模型的的训练过程。我在A6000和H100上都试验过,发现在过了几个更新后就会突然显存溢出。请问这个代码在训练的过程中,请问你们在开发的过程中是否遇到了相似的问题?谢谢!
The text was updated successfully, but these errors were encountered: