SACPO: Stepwise Alignment for Constrained Policy Optimization

This repository provides the necessary code to replicate the experiments detailed in our NeurIPS-24 paper, Stepwise Alignment for Constrained Language Model Policy Optimization. In these experiments, we utilized TRL for implementing the alignment methods DPO and KTO, and mergekit for model merging. The evaluation question lists asset/helpful_problem.json and asset/safety_problem.json were sourced from the alpaca_eval dataset and safe-rlhf, respectively.

Pretrained Models

🤗 SACPO | 🤗 P-SACPO

Getting Started

Setting Up

First, set up a virtual environment and install the required packages. We recommend using Python 3.9 or newer.

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Environment Variables

You'll need to set up environment variables for mlflow (optional, to track experiment), Amazon S3 (optional, to log artifacts), and OpenAI (required, for evaluations). Fill in your authentication details in script/set_envar.sh and then run:

sh script/set_envar.sh

Preparing Datasets

Next, prepare the training datasets for DPO and KTO from PKU-Alignment/PKU-SafeRLHF-30K.

python -m src.util prepare_all_datasets

Experiments

Experiment 1: DPO for Helpfulness -> DPO or KTO for Safety

Align for Helpfulness (DPO):

sh script/train/pku_helpful\(dpo\)_safety_train.sh helpful_dpo

Align for Safety (DPO or KTO):

sh script/train/pku_helpful\(dpo\)_safety_train.sh safety_dpo
sh script/train/pku_helpful\(dpo\)_safety_train.sh safety_kto

Successful training saves models in output/30K_helpful_dpo_safety.

Generate Responses for Evaluation:

sh script/evaluate/pku_helpful\(dpo\)_safety_eval.sh generate

Evaluate with GPT-4:

sh script/evaluate/pku_helpful\(dpo\)_safety_eval.sh evaluate_base

Generation and evaluation results will be saved in output/eval.

Experiment 2: KTO for Helpfulness -> DPO for Safety

Follow similar steps as Experiment 1, replacing helpful\(dpo\) with helpful\(kto\) in the commands.

Successful training saves models in output/pku_helpful_kto_safety.

Generation and evaluation results will be saved in output/eval.

Experiment 3: DPO for Safety -> DPO for Helpfulness

This experiment reverses the order of alignment:

Align for Safety (DPO):

sh script/train/pku_safety\(dpo\)_helpful_train.sh safety_dpo

Align for Helpfulness (DPO):

sh script/train/pku_safety\(dpo\)_helpful_train.sh helpful_dpo

Successful training saves models in output/pku_safety_dpo_helpful.

Generate Responses for Evaluation:

sh script/evaluate/pku_safety\(dpo\)_helpful_eval.sh generate

Evaluate with GPT-4:

sh script/evaluate/pku_safety\(dpo\)_helpful_eval.sh evaluate_base

Generation and evaluation results will be saved in output/eval.

Experiment 4: Practical Implementation (P-SACPO) Using Model Merging

Create Merged Models:

sh script/merge/pku_helpful\(dpo\)_safety_merge_create.sh

Generate Responses for Evaluation:

sh script/evaluate/pku_helpful\(dpo\)_safety_merge_eval.sh generate

Successful training saves models in pku_helpful_dpo_safety_merge.

Evaluate with GPT-4:

sh script/evaluate/pku_helpful\(dpo\)_safety_merge_eval.sh evaluate_base

Generation and evaluation results will be saved in output/eval.

Compute score and plot results

To compute the win rates (from the above GPT-4 evaluation) and plot the main figure in our paper:

sh script/evaluate/plot_win_rates.sh

The summary of win rates and the corresponding plot will be located in output/eval. Finally, you can obtain a plot similar to the one below

Note:

To evaluate the elo scores, please replace evaluate_base with evaluate_full in the GPT-4 evaluation step. Note that this change will cause the evaluation to compare every pair of models in each experiment, which may significantly increase the cost of using the GPT-4 API. After the evaluation is complete, you can run the following command to obtain the scores and plots:

sh script/evaluate/plot_elo_scores.sh

Our scripts assume the experiments will be conducted on a machine equipped with 8 NVIDIA A100-80G GPUs. If your setup differs, you may need to adjust the accelerate configurations in config/train, and then modify per_device_train_batch_size or gradient_accumulation_steps.
Please ensure all script paths and filenames are correct as per your directory structure. If you encounter any issues with the commands, verify the script names and paths are accurate.

Citation

If SACPO or this repository is useful in your research, please use the following BibTeX entry:

@inproceedings{
    wachi2024stepwise,
    title={Stepwise Alignment for Constrained Language Model Policy Optimization},
    author={Wachi, Akifumi and Tran, Thien Q. and Sato, Rei and Tanabe, Takumi and Akimoto, Youhei},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
}

License

Apache License 2.0

Additionally, this repository contains third-party software. Refer to NOTICE.md for more details and follow the terms and conditions of their use.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
artifacts		artifacts
asset		asset
config		config
script		script
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SACPO: Stepwise Alignment for Constrained Policy Optimization

Pretrained Models

Getting Started

Setting Up

Environment Variables

Preparing Datasets

Experiments

Experiment 1: DPO for Helpfulness -> DPO or KTO for Safety

Experiment 2: KTO for Helpfulness -> DPO for Safety

Experiment 3: DPO for Safety -> DPO for Helpfulness

Experiment 4: Practical Implementation (P-SACPO) Using Model Merging

Compute score and plot results

Citation

License

About

Releases

Packages

Contributors 4

Languages

License

line/sacpo

Folders and files

Latest commit

History

Repository files navigation

SACPO: Stepwise Alignment for Constrained Policy Optimization

Pretrained Models

Getting Started

Setting Up

Environment Variables

Preparing Datasets

Experiments

Experiment 1: DPO for Helpfulness -> DPO or KTO for Safety

Experiment 2: KTO for Helpfulness -> DPO for Safety

Experiment 3: DPO for Safety -> DPO for Helpfulness

Experiment 4: Practical Implementation (P-SACPO) Using Model Merging

Compute score and plot results

Citation

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages