Skip to content
/ SAPO Public

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment (SAPO)

License

Notifications You must be signed in to change notification settings

yinyueqin/SAPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment (SAPO)

teaser

This repository provides the official PyTorch implementation for the following paper:

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [Arxiv]
Yueqin Yin*, Zhendong Wang*, Yujia Xie, Weizhu Chen and Mingyuan Zhou
(* denotes equal contribution)
The University of Texas At Austin, Microsoft Azure AI

Abstract: Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To overcome this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model in conjunction with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with insights from historical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks, including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization, and outperforms offline self-play methods like SPIN.

Installation

  1. Clone this repo:
    git clone https://github.com/yinyueqin/SAPO.git
    cd SAPO
  2. Install dependent packages: A suitable Anaconda environment named sapo can be created and activated with:
    conda create -n sapo python=3.10
    conda activate sapo
    python -m pip install .
    python -m pip install flash-attn --no-build-isolation
    

Training

For detailed script commands, please refer to the scripts in scripts/sapo_dpo.sh and scripts/sapo_orpo.sh.

Citation

If you find this work useful for your research, please consider citing our paper:

@article{yin2024self,
  title={Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment},
  author={Yin, Yueqin and Wang, Zhendong and Xie, Yujia and Chen, Weizhu and Zhou, Mingyuan},
  journal={arXiv preprint arXiv:2405.20830},
  year={2024}
}

Acknowledgement

This repo is built upon SPIN and TRL. We thank the authors for their excellent work.

About

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment (SAPO)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published