This repository contains the code for our NeurIPS 2024 poster paper: Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer.
In this study, we explore two approaches to incorporating the supervised fine-tuning (SFT) loss as a regularizer:
- Cumulative SFT Loss: This method calculates the cumulative SFT loss over all unmasked tokens in the selected response.
- Average SFT Loss: This method computes the average SFT loss across all unmasked tokens in the chosen responses.
We implement these approaches by integrating the Alignment Handbook Codebase for the cumulative loss and the OpenRLHF Codebase for the average loss.