[GRPO] Reduce steps where loss starts to remain at 0, accelerate training #2869

zhangsheng377 · 2025-02-15T09:10:12Z

Due to the standardization of advantages, the advantages is summed in the loss calculation to make it 0, i.e. loss=β KL. But in the first step, since the actor model and ref model are the same, KL=0, which means the first loss must be 0.
And we believe that the main reason why GRPO can start training with an initial loss of 0, besides still having gradients, is due to the calculation error of the GPU.

So, our acceleration approach is to add small perturbations to the loss to accelerate the model changes caused by computational errors in the first few steps; After the formal training begins, the added disturbance is too small and can be ignored, or equivalently dropped out, to improve the model's generalization.
Specifically, we choose to add perturbations after the reward is standardized, because the key to the first step having a loss of 0 is that summing up the standardized rewards will result in it being exactly 0.

After adding perturbations, the loss obtains a perturbation with a mean of 0 and a variance of 1e-4. Because we have observed that the initial KL is larger than 1e-4, which means that after the first few steps of acceleration, the disturbance of 1e-4 can be quickly ignored.

Verification result:

After the above improvements, after about 15 steps of training, the loss starts training;
When there is no improvement, the loss needs to exceed 100 steps to reach the same level.

XZ-X · 2025-02-15T20:19:02Z

It's an interesting and inspiring finding! But I am not sure I fully understand one technical point in the analysis: a loss with the value zero does not mean the model is not learning?

There might be some negative advantages, meaning that reducing the coefficient for those advantages may further reduce the loss to below zero. That means the optimization would try to minimize the probability on those negative advantages.

Specifically, I agree with you that the formula torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) always evaluates to advantages.unsqueeze(1). From my understanding, the gradients is not zero though (due to the detach operation)?

qgallouedec · 2025-02-15T21:52:05Z

@XZ-X is right, the model doesn't learn thanks to gpu error. It learns because even if the loss is zero, its gradient isn't zero.

zhangsheng377 · 2025-02-16T12:55:29Z

It's an interesting and inspiring finding! But I am not sure I fully understand one technical point in the analysis: a loss with the value zero does not mean the model is not learning?

There might be some negative advantages, meaning that reducing the coefficient for those advantages may further reduce the loss to below zero. That means the optimization would try to minimize the probability on those negative advantages.

Specifically, I agree with you that the formula torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) always evaluates to advantages.unsqueeze(1). From my understanding, the gradients is not zero though (due to the detach operation)?

Well, yes, the gradients is not zero when loss is 0. So in this way, if I want to speed up its startup process, I just need to multiply the advantages by a factor? In order to avoid affecting the actual training, do I need to limit the number of steps that can take effect?
Wait, can I just turn off the warmup?

qgallouedec · 2025-02-18T16:30:17Z

Are the proposed changes backed up by a study and results that describe and demonstrate the added value of the associated trick?

Update grpo_trainer.py

b848289

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRPO] Reduce steps where loss starts to remain at 0, accelerate training #2869

[GRPO] Reduce steps where loss starts to remain at 0, accelerate training #2869

zhangsheng377 commented Feb 15, 2025 •

edited

Loading

XZ-X commented Feb 15, 2025

qgallouedec commented Feb 15, 2025

zhangsheng377 commented Feb 16, 2025

qgallouedec commented Feb 18, 2025

[GRPO] Reduce steps where loss starts to remain at 0, accelerate training #2869

Are you sure you want to change the base?

[GRPO] Reduce steps where loss starts to remain at 0, accelerate training #2869

Conversation

zhangsheng377 commented Feb 15, 2025 • edited Loading

XZ-X commented Feb 15, 2025

qgallouedec commented Feb 15, 2025

zhangsheng377 commented Feb 16, 2025

qgallouedec commented Feb 18, 2025

zhangsheng377 commented Feb 15, 2025 •

edited

Loading