-
Notifications
You must be signed in to change notification settings - Fork 1.9k
GRPO split generations into multiple training batches #3017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Let's say that you've 8 GPUs, in the limit you can have
That's not exactly that. It's per_device_train_batch_size*num_devices that must be a multiple of While I understand the motivation, I think it's not straightforward to implement. |
Ah yes, sorry I forgot about number of devices. Though this doesn't change much right, we just amend my statement to
Is it complicated because currently the prepare_inputs method does both the generation and score calculation then the inputs are passed straight to the compute_loss method by the Trainer superclass? I can see how it could cause more issues than it is worth having to fiddle with the core pipeline just for one trainer. I just thought I would bring it because I noticed how much smoother the training seemed when I was able to up the number of generations using smaller models and this seemed to be the big bottleneck to that. |
Yes that's correct
You can increase the number of generations quite high actually. Example, if you've 8 GPUs that can handle 4 generations, you can use number of generations per prompt up to 32, |
Ok, I understand, thanks for your prompt responses. Unfortunately I am most interested in using this on my personal gpu so I am not using multiple gpu clusters. Thanks for your time, I am happy for the issue to be closed since it is not deemed feasible. |
With 1 GPU, the best you can do is to set
To have an effective batch size of 128 |
I understand this but it doesn't solve the issue of the loss function being an estimation based on a sample size of 8. Based on the GRPO loss formulation the expectation we estimate is conditional on the input prompt as are the advantage calculations and just increasing the gradient accumulation to 16 gives us 16 high variance estimates of the expectation rather than one low variance estimation. I hope this makes sense. As I said before I can see why this is deemed not worth it since most large scale use cases can probably afford to just up the number of gpus. I had just hoped it would be an easier adjustment that would allow us hobbyists to stick closer to the theory of the paper. |
Then you should increase |
In fact, this is tricky, as it would involve sampling, generating and calculating the advantage for the whole batch, then iterating somehow over the batch. It's not impossible, but it adds an implementation complexity that I don't think is justified. |
Forgive my naivety but would it not be as simple as overiding the
to somehting like
I have added comments starting with |
Sorry, I am not trying to be a pain. As I said previously I am happy for you to close this if it is just a no go. Just thought I would offer the suggestion in case it helped. |
It might work, but that's the complexity I want to avoid. Forking the repo might be the best option here. Or subclass |
Ok, I am happy to do that. I won't bog you down anymore on this. |
Actually, being restricted on the minibatch size by the number of trajectories is very limiting. |
If I understand correctly, per_device_train_batch_size is an integer, which means single GPU should be able to handle a backprop. H100 has roughly 80GB memory and I encountered GPU OOM with Qwen2-7B model. If I'm correct, this could be quite a constraint as bigger models cannot be run. |
Hi @JamesBowerXanda I ran into a similar thing as what you had and needed a larger generation batch size. I've implemented something which you can run using this. As mentioned above, I overwrote training_step within GRPOTrainer for this to work.
You can find it here |
Feature request
In the GRPO training it would be useful if you could split the generations into smaller batches for the gradient calculations similar to how we split batches into multiple gradient calculations with gradient_accumulation_steps.
I am imagining the config to work something like this:
with the condition that
per_device_train_batch_size * gradient_accumulation_steps
is a multiple ofnum_generations
.Motivation
In the GRPO algorithm the loss calculation (ignoring the KL part) is an estimation of an expectation under the current models distribution. This will have very high variance if we are limiting the sample size (number of generations) to small numbers giving us a poor estimation of the expectation and therefore making training less stable.
Currently
per_device_train_batch_size
must be a multiple ofnum_generations
which can severely limit how large you can make it before hitting OOM particularly when in resource constrained environments working with long context windows. This seems like an unnecessary restriction since nothing in the algorithm stops us from splitting the gradient calculation of a generation batch into multiple smaller batches.Your contribution
I don't think I would be able to create the PR myself unfortunately.
The text was updated successfully, but these errors were encountered: