Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练中的显存问题 #36

Open
LILIXIYA opened this issue Jul 29, 2024 · 1 comment
Open

训练中的显存问题 #36

LILIXIYA opened this issue Jul 29, 2024 · 1 comment

Comments

@LILIXIYA
Copy link

您好,我在尝试复现模型的的训练过程。我在A6000和H100上都试验过,发现在过了几个更新后就会突然显存溢出。请问这个代码在训练的过程中,请问你们在开发的过程中是否遇到了相似的问题?谢谢!

@jymChen
Copy link
Contributor

jymChen commented Aug 12, 2024

@LILIXIYA 您好,A6000和H100是可以训练的,不过可能需要多张卡,最好用两卡或4卡以上的机器上进行训练。如果显存还是不够,可以尝试用lora或者采用Zero3的并行策略减少显存使用,还可以减小batch_size。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants