(fix): fix values response mask in dp critic. #50

PanAndy · 2024-12-15T11:36:34Z

The values need to be multiplied by the response mask; otherwise, the calculation of adv later will be incorrect. This is similar to the approach used in megatron_critic:
https://github.com/PanAndy/verl/blob/681dc00b78b90cda4288535ca9a6c1ec7a4b7c98/verl/trainer/ppo/critic/megatron_critic.py#L92

PeterSH6 · 2024-12-16T03:04:00Z

Hi @PanAndy , Thanks for your contribution!

We did mask out the values but only after the advantages were computed. (See compute_gae_advantage_return)
I think our implementation is mathematical equivalent to yours (adding the mask after value computation).
Do you find any difference in training after your commit?

PanAndy · 2024-12-16T03:19:12Z

Thanks for your reply. However, I want to clarify that the approach of applying the mask directly on the values, as in values = values * eos_mask before the advantage computation, is conceptually different from applying it after. The key issue lies in the calculation order: since advantages are computed in a reverse manner, starting from the end of the sequence and accumulating toward the beginning, any non-zero values from the eos_token to gen_len would impact the logic if they are not masked out first.

If we perform advantages = verl_F.masked_whiten(advantages, eos_mask) without first applying the mask to values, the computed advantages might include contributions from those unmasked values, leading to a potential inconsistency in how the advantages are derived.

To maintain mathematical consistency and ensure correctness in our advantage calculation, it's essential to follow a process similar to what is done in megatron_critic, where we mask out the values before computing the advantages:

def compute_gae_advantage_return(token_level_rewards: torch.Tensor, values: torch.Tensor, eos_mask: torch.Tensor,
                                 gamma: torch.Tensor, lam: torch.Tensor):
    with torch.no_grad():
        lastgaelam = 0
        advantages_reversed = []
        gen_len = token_level_rewards.shape[-1]

        # TODO: Apply the mask to values to ensure consistency
        # values = values * eos_mask

        for t in reversed(range(gen_len)):
            nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0
            delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
            lastgaelam = delta + gamma * lam * lastgaelam
            advantages_reversed.append(lastgaelam)
        advantages = torch.stack(advantages_reversed[::-1], dim=1)

        returns = advantages + values
        advantages = verl_F.masked_whiten(advantages, eos_mask)
    return advantages, returns

This adjustment ensures that the advantages are calculated based solely on the relevant values, leading to more accurate training outcomes.

PanAndy · 2024-12-16T03:24:07Z

Similar to the processing in TRL and Open-RLHF:
https://github.com/OpenRLHF/OpenRLHF/blob/e742e0144f852d01f78ff99ebcf804610469e0e5/openrlhf/trainer/ppo_utils/experience_maker.py#L398C13-L398C19
https://github.com/huggingface/trl/blob/33fb9efc4379d10d8de55f3f8313f5bf72c38a94/trl/trainer/ppo_trainer.py#L499C17-L499C23

PeterSH6 · 2024-12-16T05:29:34Z

LGTM!
We have re-examed this issue. You're right that not masking out the value in advance is not mathematical equivalent (although it could have a similar policy gradient direction).
It's quirky that we implemented this in megatron_critic.py but forget it in dp_critic.py...
Thanks for your detailed examination and I'll approve it for CI first.

(fix): fix values response mask in dp critic.

681dc00

PeterSH6 merged commit c7534db into volcengine:main Dec 16, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): fix values response mask in dp critic. #50

(fix): fix values response mask in dp critic. #50

PanAndy commented Dec 15, 2024

PeterSH6 commented Dec 16, 2024

PanAndy commented Dec 16, 2024

PanAndy commented Dec 16, 2024

PeterSH6 commented Dec 16, 2024

(fix): fix values response mask in dp critic. #50

(fix): fix values response mask in dp critic. #50

Conversation

PanAndy commented Dec 15, 2024

PeterSH6 commented Dec 16, 2024

PanAndy commented Dec 16, 2024

PanAndy commented Dec 16, 2024

PeterSH6 commented Dec 16, 2024