GitHub - yinyueqin/relative-preference-optimization: Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

This repository provides the official PyTorch implementation for the following paper:

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [Arxiv]
Yueqin Yin*, Zhendong Wang*, Yi Gu, Hai Huang, Weizhu Chen and Mingyuan Zhou
(* denotes equal contribution)
The University of Texas At Austin, Microsoft Azure AI, Google

Abstract: In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training-process.

Installation

Clone this repo:

git clone https://github.com/yinyueqin/Relative-Preference-Optimization.git
cd Relative-Preference-Optimization

Install dependent packages: A suitable Anaconda environment named rpo can be created and activated with:
```
conda env create -f environment.yaml
conda activate rpo
```

Training

Refer to the scripts/train.sh file, or use the following example command:

# SFT Stage
python train.py loss=sft model=mistral7b datasets='[hh]' exp_name=sft_hh_mistral_7b mode=train ++cache_dir=.cache/

# RPO Stage
python train.py loss=rpo-paired model=mistral7b datasets='[hh]' exp_name=rpo-paired_mistral_7b_hh_MiniLM_0.5 mode=train ++cache_dir=.cache/ ++model.load_from=.cache/root/sft_hh_mistral_7b_2024-01-24_04-09-19_840154/LATEST/policy.pt ++loss.distance_temperature=0.5 ++loss.sentence_transformer_name_or_path=all-MiniLM-L6-v2

Sampling

Refer to the scripts/sample.sh file, or use the following example command:

python eval.py config-path=/root/rpo-paired_llama2_7b_hh_all-MiniLM-L6-v2_0.25_2024-01-25_02-26-53_155161 ++mode=sample ++n_samples=256 ++model.eval_batch_size=32 ++samples_dir=examples/

Evaluation

Refer to the scripts/gpt4_dialogue.sh file, scripts/gpt4_summarization.sh file, scripts/alpaca_eval.sh file or use the following example command:

# Dialogue
python eval.py config-path=/root/rpo-paired_llama2_7b_hh_all-MiniLM-L6-v2_0.25_2024-01-25_02-26-53_155161 ++mode=sample ++n_samples=256 ++model.eval_batch_size=32 ++samples_dir=examples/

python compare.py -f samples/rpo-paired_llama2_7b_hh_all-MiniLM-L6-v2_0.25_2024-01-25_02-26-53_155161.json -mc 256 -bk chosen -ck policy -r results -j gpt-4-0613

# Summarization
python eval.py config-path=/root/rpo-paired_llama2_7b_tldr_all-MiniLM-L6-v2_0.25_2024-01-24_12-46-26_630089 ++mode=sample ++n_samples=256 ++model.eval_batch_size=32 ++samples_dir=examples/

python compare.py -f samples/rpo-paired_llama2_7b_tldr_all-MiniLM-L6-v2_0.25_2024-01-24_12-46-26_630089.json -mc 256 -bk chosen -ck policy -r results -j gpt-4-0613 -t summarization 

# AlpacaEval2.0 Benchmark
python eval.py --config-path=/root/rpo-paired_llama2_7b_hh_all-MiniLM-L6-v2_0.25_2024-01-25_02-26-53_155161 ++mode=alpacaeval ++model.eval_batch_size=32 ++samples_dir=samples_alpaca/

alpaca_eval --model_outputs samples/alpaca_rpo-paired_llama2_7b_hh_all-MiniLM-L6-v2_0.25_2024-01-25_02-26-53_155161.json --annotators_config 'alpaca_eval_gpt4_turbo_fn' --name "rpo-paired"

Citation

If you find this work useful for your research, please consider citing our paper:

@article{yin2024relative,
  title={Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts},
  author={Yin, Yueqin and Wang, Zhendong and Gu, Yi and Huang, Hai and Chen, Weizhu and Zhou, Mingyuan},
  journal={arXiv preprint arXiv:2402.10958},
  year={2024}
}

Acknowledgement

This repo is heavily built upon DPO and KTO. We thank the authors for their excellent work.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
config		config
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compare.py		compare.py
dataloader.py		dataloader.py
environment.yaml		environment.yaml
eval.py		eval.py
models.py		models.py
requirements.txt		requirements.txt
train.py		train.py
trainers.py		trainers.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Installation

Training

Sampling

Evaluation

Citation

Acknowledgement

About

Releases

Packages

Languages

License

yinyueqin/relative-preference-optimization

Folders and files

Latest commit

History

Repository files navigation

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Installation

Training

Sampling

Evaluation

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages