You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, SAC is designed for continuous action spaces, whereas NLP tasks involve discrete token outputs. (A discrete variant of SAC exists though.)
Second, SAC lacks a mechanism to constrain the policy from deviating too far from the initial model. PPO, on the other hand, explicitly limits policy updates, which is crucial in RLHF to maintain alignment and preserve the pretrained model’s capabilities. Without such constraints, SAC could result in a policy that either drifts excessively or remains overly similar to the original model.
Finally, SAC's entropy maximization encourages broad exploration, which may be counterproductive in RLHF. Finetuning typically involves domain-specific data, and excessive exploration could lead to unaligned or undesirable behaviors. This mechanism might inadvertently encourage unlearning of pretrained knowledge.
That said, these points are speculative and based on intuition. I'd be eager to see papers or results that either confirm or challenge these hypotheses.
The RLOO trainer also lacks PPO's clipping mechanism that constrains the policy from deviating too far from the previous policy. It turns out that for RLHF on pretrained language models, that clipping step is not necessary.
If you are referring to the reference policy, I don't see why a KL divergence term with a reference policy cannot be included into the SAC loss function.
Mode collapse and loss of variety is a common problem for aligned models, so if SAC makes a different tradeoff, encouraging exploration, then that could be useful.
Method description
SAC is often competetive or superior to PPO. Why isn't it offered as a trl trainer?
Open source status
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered: