Learning Diverse Attacks on Large Language Models for Robust Red-teaming and Safety Tuning

This repository contains code for Red-teaming with GFlowNet, as described in the paper.

Note: We are actively working on testing and cleaning up the code for release. If you encounter issues before that please let us know!

Learning Diverse Attacks on Large Language Models for Robust Red-teaming and Safety Tuning
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain
Paper: https://arxiv.org/abs/2405.18540

BibTeX

@article{
lee2025learning,
title={Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning},
author={Seanie Lee and Minsu Kim and Lynn Cherif and David Dobre and Juho Lee and Sung Ju Hwang and Kenji Kawaguchi and Gauthier Gidel and Yoshua Bengio and Nikolay Malkin and Moksh Jain},
journal={International Conference on Learning Representations (ICLR)},
year={2025}
}

Installation of Dependencies

conda env create -n redteam python=3.10
conda activate redteam
pip install -r requirements.txt

SFT

You can download the checkpoint from link. and save it under the directory ./save

Command-line script

```bash python main.py \ --mode sft \ --lr 3e-5 \ --train_steps 100 \ --grad_acc_steps 32 \ --batch_size 1024 \ --prompt_file ./prompts/alpaca.jsonl \ --few_shot_file ./prompts/sft_dataset.json \ --exp_name gpt2-sft-position-final ```

GFN fine-tuning with GPT2 target model

You can try three different victim models: ["vicgalle/gpt2-alpaca", "meta-llama/Llama-2-7b-chat-hf", "databricks/dolly-v2-7b"].

Command-line script

```bash python main.py \ --exp_name gpt2-gfn \ --sim_tolerance 0.25 \ --victim_model vicgalle/gpt2-alpaca \ --lr 1e-4 \ --max_len 20 \ --reward_sched_horizon 500 \ --train_steps 5000 \ --seed 42 \ --temp_low 0.5 \ --temp_high 2.0 \ --buffer_size 1000 \ --lm_sched_end 1.0 \ --beta 0.1 \ --sim_tolerance 0.25 ``` and the command for evaluation: ``` python eval.py \ --ckpt save/gpt2-gfn/latest \ --output_file gpt2-gfn \ --victim_model gpt2 ```

GFN fine-tuning with Dolly target model

Command-line script

```bash python main.py \ --exp_name dolly-gfn \ --sim_tolerance 0.25 \ --victim_model databricks/dolly-v2-7b \ --lr 1e-4 \ --max_len 20 \ --reward_sched_horizon 500 \ --train_steps 5000 \ --seed 42 \ --temp_low 0.5 \ --temp_high 2.0 \ --lm_sched_horizon 2000 \ --lm_sched_end 1.0 \ --buffer_size 1000 \ --compare reward \ --beta 0.1 ```

python eval.py \
--ckpt save/dolly-gfn/latest \
--output_file dolly-gfn \
--victim_model dolly

GFN fine-tuning with Gemma-2b-it target model

Command-line script

```bash python main.py \ --exp_name gemma-gfn \ --sim_tolerance 0.3 \ --victim_model google/gemma-2b-it \ --lr 1e-4 \ --reward_sched_horizon 1000 \ --train_steps 5000 \ --buffer_size 5000 \ --seed 42 \ --max_len 20 \ --temp_low 0.7 \ --temp_high 2.0 \ --lm_sched_end 1.2 \ --lm_sched_horizon 2000 \ --compare c_reward \ --prioritization c_reward \ --beta 0.1 \ --metric cosine ```

python eval.py \
--ckpt /save/gemma-gfn/latest \
--output_file gemma-gfn \
--victim_model gemma

GFN fine-tuning with Llama-2-7b target model

Command-line script

```bash python main.py \ --exp_name llama-gfn \ --sim_tolerance 0.3 \ --victim_model meta-llama/Llama-2-7b-chat-hf \ --lr 1e-4 \ --reward_sched_horizon 1000 \ --train_steps 5000 \ --buffer_size 5000 \ --seed 42 \ --max_len 20 \ --temp_low 0.7 \ --temp_high 2.0 \ --lm_sched_end 1.2 \ --lm_sched_horizon 2000 \ --sft_ckpt $SFT_CKPT \ --compare c_reward \ --prioritization c_reward \ --beta 0.1 \ --metric cosine ```

python eval.py \
--ckpt save/llama-gfn/latest \
--output_file llama-gfn \
--victim_model llama

MLE-smoothing

For gpt2, dolly target model, we set to train_steps for 2000 and batch_size for 2048. For the other models, we use 1000 and 1024 for train_steps and batch_size, respectively. Note that $\text{target\_model} \in \{\text{gpt2}, \text{dolly}, \text{gemma}, \text{llama} \}$.

Collect offline samples based on reward:

python collect_samples.py --exp_name "{target_model}"_gfn

Run MLE smoothing:

Command-line script

```bash python main.py \ --mode distillation \ --exp_name "{target_model}"_mle \ --lr 1e-4 \ --seed 42 \ --batch_size "batch_size" \ --train_steps "train_steps" \ --grad_acc_steps 8 \ --model_name gpt2-sft-position-final \ --few_shot_file offline_dataset/"{target_model}"_gfn ```

For evaluation:

python eval.py \
--ckpt save/"{target_model}"_gfn/latest \
--output_file "{target_model}"_mle \
--victim_model "{target_model}" \
--no_lora

Transfer experiments

First train an attacker model with GFlowNet and MLE for Gemma and you can transfer to attack new target models from $\{\text{llama, mistral, gemma, starling} \}$

Command-line script

```bash python run_transfer.py \ --victim_model gemma \ --exp_name gemma_mle \ --target_model llama ```

Safety Fine-tuning

For safety fine-tuning, train gflownet + MLE to red-team Gemma-2b-it and generate attack prompts.

Command-line script

```bash python safety_datset/create_safety_response.py \ --input_file results/gemma/gemma_mle.json --output_file safety_dataset/gemma_mle.json ```

python main.py \
--model_name google/gemma-2b-it \
--save_dir /network/scratch/s/seanie.lee/redteam/save \
--mode safety \
--lr 5e-5 \
--weight_decay 0.0 \
--batch_size 32 \
--num_warmup_steps 0 \
--epoch 2 \
--prompt_file safety_dataset/gemma_mle.json \
--exp_name gfn-safety-tuned

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
prompts		prompts
safety_dataset		safety_dataset
trainers		trainers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collect_samples.py		collect_samples.py
dataset.py		dataset.py
eval.py		eval.py
main.py		main.py
requirements.txt		requirements.txt
run_transfer.py		run_transfer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Diverse Attacks on Large Language Models for Robust Red-teaming and Safety Tuning

Installation of Dependencies

SFT

GFN fine-tuning with GPT2 target model

GFN fine-tuning with Dolly target model

GFN fine-tuning with Gemma-2b-it target model

GFN fine-tuning with Llama-2-7b target model

MLE-smoothing

Transfer experiments

Safety Fine-tuning

About

Releases

Packages

Contributors 2

Languages

License

GFNOrg/red-teaming

Folders and files

Latest commit

History

Repository files navigation

Learning Diverse Attacks on Large Language Models for Robust Red-teaming and Safety Tuning

Installation of Dependencies

SFT

GFN fine-tuning with GPT2 target model

GFN fine-tuning with Dolly target model

GFN fine-tuning with Gemma-2b-it target model

GFN fine-tuning with Llama-2-7b target model

MLE-smoothing

Transfer experiments

Safety Fine-tuning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages