Replies: 1 comment
-
Use a smaller batch size (easiest and most obvious way), Use a different activation checkpointing configuration (try changing the contiguous_memory_optimization parameter to false to see if that helps), and Use a different optimizer (Some optimizers, such as AdamW, use more memory than others) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Here is my config
I'm using LORA (https://arxiv.org/abs/2106.09685) so the gradient and optimizer shouldn't take too much memory since the trainable parameter number is very small.
But I got OOM during forward, it's already ZERO3 with offload. Is there ways to reduce the forward memory usage? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions