Why isn't Soft-Actor Critic (SAC) Available for RLHF? #2465

AMindToThink · 2024-12-11T22:43:26Z

Method description

SAC is often competetive or superior to PPO. Why isn't it offered as a trl trainer?

Open source status

The method implementation is available
The model weights are available
The training datasets are available

Provide useful links for the implementation

No response

qgallouedec · 2024-12-12T10:44:50Z

First, SAC is designed for continuous action spaces, whereas NLP tasks involve discrete token outputs. (A discrete variant of SAC exists though.)

Second, SAC lacks a mechanism to constrain the policy from deviating too far from the initial model. PPO, on the other hand, explicitly limits policy updates, which is crucial in RLHF to maintain alignment and preserve the pretrained model’s capabilities. Without such constraints, SAC could result in a policy that either drifts excessively or remains overly similar to the original model.

Finally, SAC's entropy maximization encourages broad exploration, which may be counterproductive in RLHF. Finetuning typically involves domain-specific data, and excessive exploration could lead to unaligned or undesirable behaviors. This mechanism might inadvertently encourage unlearning of pretrained knowledge.

That said, these points are speculative and based on intuition. I'd be eager to see papers or results that either confirm or challenge these hypotheses.

AMindToThink · 2024-12-13T21:11:39Z

Thank you for the response.

The RLOO trainer also lacks PPO's clipping mechanism that constrains the policy from deviating too far from the previous policy. It turns out that for RLHF on pretrained language models, that clipping step is not necessary.

If you are referring to the reference policy, I don't see why a KL divergence term with a reference policy cannot be included into the SAC loss function.

Mode collapse and loss of variety is a common problem for aligned models, so if SAC makes a different tradeoff, encouraging exploration, then that could be useful.

qgallouedec · 2024-12-13T22:13:21Z

lacks PPO's clipping mechanism that constrains the policy from deviating too far from the previous policy

There is a KL term though

I don't see why a KL divergence term with a reference policy cannot be included into the SAC loss function.

I guess you can, it's just that in its classic formulation, the SAC objective doesn't contain such a term.

encouraging exploration, then that could be useful

I'm not so sure about that. But if you manage to produce or find results that help to see more clearly on this matter, please share them.

qgallouedec self-assigned this Dec 12, 2024

qgallouedec added the ❓ question Seeking clarification or more information label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why isn't Soft-Actor Critic (SAC) Available for RLHF? #2465

Why isn't Soft-Actor Critic (SAC) Available for RLHF? #2465

AMindToThink commented Dec 11, 2024

qgallouedec commented Dec 12, 2024

AMindToThink commented Dec 13, 2024

qgallouedec commented Dec 13, 2024

Why isn't Soft-Actor Critic (SAC) Available for RLHF? #2465

Why isn't Soft-Actor Critic (SAC) Available for RLHF? #2465

Comments

AMindToThink commented Dec 11, 2024

Method description

Open source status

Provide useful links for the implementation

qgallouedec commented Dec 12, 2024

AMindToThink commented Dec 13, 2024

qgallouedec commented Dec 13, 2024