Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): fix values response mask in dp critic. #50

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

PanAndy
Copy link
Contributor

@PanAndy PanAndy commented Dec 15, 2024

The values need to be multiplied by the response mask; otherwise, the calculation of adv later will be incorrect. This is similar to the approach used in megatron_critic:
https://github.com/PanAndy/verl/blob/681dc00b78b90cda4288535ca9a6c1ec7a4b7c98/verl/trainer/ppo/critic/megatron_critic.py#L92

@PeterSH6
Copy link
Collaborator

Hi @PanAndy , Thanks for your contribution!

We did mask out the values but only after the advantages were computed. (See compute_gae_advantage_return)
I think our implementation is mathematical equivalent to yours (adding the mask after value computation).
Do you find any difference in training after your commit?

@PanAndy
Copy link
Contributor Author

PanAndy commented Dec 16, 2024

Thanks for your reply. However, I want to clarify that the approach of applying the mask directly on the values, as in values = values * eos_mask before the advantage computation, is conceptually different from applying it after. The key issue lies in the calculation order: since advantages are computed in a reverse manner, starting from the end of the sequence and accumulating toward the beginning, any non-zero values from the eos_token to gen_len would impact the logic if they are not masked out first.

If we perform advantages = verl_F.masked_whiten(advantages, eos_mask) without first applying the mask to values, the computed advantages might include contributions from those unmasked values, leading to a potential inconsistency in how the advantages are derived.

To maintain mathematical consistency and ensure correctness in our advantage calculation, it's essential to follow a process similar to what is done in megatron_critic, where we mask out the values before computing the advantages:

def compute_gae_advantage_return(token_level_rewards: torch.Tensor, values: torch.Tensor, eos_mask: torch.Tensor,
                                 gamma: torch.Tensor, lam: torch.Tensor):
    with torch.no_grad():
        lastgaelam = 0
        advantages_reversed = []
        gen_len = token_level_rewards.shape[-1]

        # TODO: Apply the mask to values to ensure consistency
        # values = values * eos_mask

        for t in reversed(range(gen_len)):
            nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0
            delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
            lastgaelam = delta + gamma * lam * lastgaelam
            advantages_reversed.append(lastgaelam)
        advantages = torch.stack(advantages_reversed[::-1], dim=1)

        returns = advantages + values
        advantages = verl_F.masked_whiten(advantages, eos_mask)
    return advantages, returns

This adjustment ensures that the advantages are calculated based solely on the relevant values, leading to more accurate training outcomes.

@PanAndy
Copy link
Contributor Author

PanAndy commented Dec 16, 2024

@PeterSH6
Copy link
Collaborator

LGTM!
We have re-examed this issue. You're right that not masking out the value in advance is not mathematical equivalent (although it could have a similar policy gradient direction).
It's quirky that we implemented this in megatron_critic.py but forget it in dp_critic.py...
Thanks for your detailed examination and I'll approve it for CI first.

@PeterSH6 PeterSH6 merged commit c7534db into volcengine:main Dec 16, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants