Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRPO] Reduce steps where loss starts to remain at 0, accelerate training #2869

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zhangsheng377
Copy link

@zhangsheng377 zhangsheng377 commented Feb 15, 2025

#2703

Due to the standardization of advantages, the advantages is summed in the loss calculation to make it 0, i.e. loss=β KL. But in the first step, since the actor model and ref model are the same, KL=0, which means the first loss must be 0.
And we believe that the main reason why GRPO can start training with an initial loss of 0, besides still having gradients, is due to the calculation error of the GPU.

So, our acceleration approach is to add small perturbations to the loss to accelerate the model changes caused by computational errors in the first few steps; After the formal training begins, the added disturbance is too small and can be ignored, or equivalently dropped out, to improve the model's generalization.
Specifically, we choose to add perturbations after the reward is standardized, because the key to the first step having a loss of 0 is that summing up the standardized rewards will result in it being exactly 0.

After adding perturbations, the loss obtains a perturbation with a mean of 0 and a variance of 1e-4. Because we have observed that the initial KL is larger than 1e-4, which means that after the first few steps of acceleration, the disturbance of 1e-4 can be quickly ignored.

Verification result:
image

After the above improvements, after about 15 steps of training, the loss starts training;
When there is no improvement, the loss needs to exceed 100 steps to reach the same level.

@XZ-X
Copy link
Contributor

XZ-X commented Feb 15, 2025

It's an interesting and inspiring finding! But I am not sure I fully understand one technical point in the analysis: a loss with the value zero does not mean the model is not learning?

There might be some negative advantages, meaning that reducing the coefficient for those advantages may further reduce the loss to below zero. That means the optimization would try to minimize the probability on those negative advantages.

Specifically, I agree with you that the formula torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) always evaluates to advantages.unsqueeze(1). From my understanding, the gradients is not zero though (due to the detach operation)?

@qgallouedec
Copy link
Member

@XZ-X is right, the model doesn't learn thanks to gpu error. It learns because even if the loss is zero, its gradient isn't zero.

@zhangsheng377
Copy link
Author

It's an interesting and inspiring finding! But I am not sure I fully understand one technical point in the analysis: a loss with the value zero does not mean the model is not learning?

There might be some negative advantages, meaning that reducing the coefficient for those advantages may further reduce the loss to below zero. That means the optimization would try to minimize the probability on those negative advantages.

Specifically, I agree with you that the formula torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) always evaluates to advantages.unsqueeze(1). From my understanding, the gradients is not zero though (due to the detach operation)?

Well, yes, the gradients is not zero when loss is 0. So in this way, if I want to speed up its startup process, I just need to multiply the advantages by a factor? In order to avoid affecting the actual training, do I need to limit the number of steps that can take effect?
Wait, can I just turn off the warmup?

@qgallouedec
Copy link
Member

Are the proposed changes backed up by a study and results that describe and demonstrate the added value of the associated trick?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants