-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练完成后OOM #2
Comments
你是训的哪个模型,用什么方法的?是训练完之后保存模型时报的错误吗? |
ChatGLM, 用的是lora_deepspeed, 4卡A10(24G), 模型训练完成就报OOM,源码中save model的部分被注释掉了 |
你应该时没有配置deepspeed config,需要执行accelerate config,另外deepspeed在执行时还会有个bug需要改下源码,具体如何操作你参考下这篇博客https://www.cnblogs.com/jiangxinyang/p/17330352.html |
好的,感谢大佬指导! |
大佬,按照你的指导配置deepspeed config,执行 bash run.sh 后 又报新的错误: Error: Incorrect padding(base64) │ │ |
这个看上去像是训练数据编码的问题,你用utf8编码读取数据看看 |
目前看不是utf8编码的问题,就是配置accelerate config之后会报这个错误,不配置accelerate config可以正常训练,但是训练完成保存时会OOM, 这个不知道跟accelerate的版本号有没有关系 |
大佬,模型训练完后就报OOM,怎么解?
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 22.06 GiB total capacity;
The text was updated successfully, but these errors were encountered: