[GRPO] Reduce steps where loss starts to remain at 0, accelerate training #2869
+4
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#2703
Due to the standardization of
advantages
, theadvantages
is summed in the loss calculation to make it 0, i.e. loss=β KL. But in the first step, since the actor model and ref model are the same, KL=0, which means the first loss must be 0.And we believe that the main reason why GRPO can start training with an initial loss of 0, besides still having gradients, is due to the calculation error of the GPU.
So, our acceleration approach is to add small perturbations to the loss to accelerate the model changes caused by computational errors in the first few steps; After the formal training begins, the added disturbance is too small and can be ignored, or equivalently dropped out, to improve the model's generalization.

Specifically, we choose to add perturbations after the reward is standardized, because the key to the first step having a loss of 0 is that summing up the standardized rewards will result in it being exactly 0.
After adding perturbations, the loss obtains a perturbation with a mean of 0 and a variance of 1e-4. Because we have observed that the initial KL is larger than 1e-4, which means that after the first few steps of acceleration, the disturbance of 1e-4 can be quickly ignored.
Verification result:
After the above improvements, after about 15 steps of training, the loss starts training;
When there is no improvement, the loss needs to exceed 100 steps to reach the same level.