Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why isn't Soft-Actor Critic (SAC) Available for RLHF? #2465

Open
3 tasks
AMindToThink opened this issue Dec 11, 2024 · 3 comments
Open
3 tasks

Why isn't Soft-Actor Critic (SAC) Available for RLHF? #2465

AMindToThink opened this issue Dec 11, 2024 · 3 comments
Assignees
Labels
❓ question Seeking clarification or more information

Comments

@AMindToThink
Copy link

Method description

SAC is often competetive or superior to PPO. Why isn't it offered as a trl trainer?

Open source status

  • The method implementation is available
  • The model weights are available
  • The training datasets are available

Provide useful links for the implementation

No response

@qgallouedec
Copy link
Member

First, SAC is designed for continuous action spaces, whereas NLP tasks involve discrete token outputs. (A discrete variant of SAC exists though.)

Second, SAC lacks a mechanism to constrain the policy from deviating too far from the initial model. PPO, on the other hand, explicitly limits policy updates, which is crucial in RLHF to maintain alignment and preserve the pretrained model’s capabilities. Without such constraints, SAC could result in a policy that either drifts excessively or remains overly similar to the original model.

Finally, SAC's entropy maximization encourages broad exploration, which may be counterproductive in RLHF. Finetuning typically involves domain-specific data, and excessive exploration could lead to unaligned or undesirable behaviors. This mechanism might inadvertently encourage unlearning of pretrained knowledge.

That said, these points are speculative and based on intuition. I'd be eager to see papers or results that either confirm or challenge these hypotheses.

@qgallouedec qgallouedec self-assigned this Dec 12, 2024
@qgallouedec qgallouedec added the ❓ question Seeking clarification or more information label Dec 12, 2024
@AMindToThink
Copy link
Author

Thank you for the response.

The RLOO trainer also lacks PPO's clipping mechanism that constrains the policy from deviating too far from the previous policy. It turns out that for RLHF on pretrained language models, that clipping step is not necessary.

If you are referring to the reference policy, I don't see why a KL divergence term with a reference policy cannot be included into the SAC loss function.

Mode collapse and loss of variety is a common problem for aligned models, so if SAC makes a different tradeoff, encouraging exploration, then that could be useful.

@qgallouedec
Copy link
Member

lacks PPO's clipping mechanism that constrains the policy from deviating too far from the previous policy

There is a KL term though

I don't see why a KL divergence term with a reference policy cannot be included into the SAC loss function.

I guess you can, it's just that in its classic formulation, the SAC objective doesn't contain such a term.

encouraging exploration, then that could be useful

I'm not so sure about that. But if you manage to produce or find results that help to see more clearly on this matter, please share them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❓ question Seeking clarification or more information
Projects
None yet
Development

No branches or pull requests

2 participants