Skip to content

[ICLR Workshop 2025] An official source code for paper "GuardReasoner: Towards Reasoning-based LLM Safeguards".

Notifications You must be signed in to change notification settings

yueliu1999/GuardReasoner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner.

Update

  • (2025/03/06) The paper has been accepted by the ICLR 2025 FM-Wild Workshop.
  • (2025/02/02) The training pipeline is released.
  • (2025/02/01) The training data GuardReasonerTrain is released.
  • (2025/01/31) The models are released (1B, 3B, 8B).
  • (2025/01/31) The code of GuardReasoner is released.
  • (2025/01/31) GuardReasoner is on arXiv.

Usage

Quick Start

To evaluate GuardReasoner, run the following code.

python ./evaluate.py

Main Result

Table 1: Performance on Prompt Harmfulness Detection Task.

Table 2: Performance of Response Harmfulness Detection Task.

Table 3: Performance on Refusal Detection Task.

Development Version

To reproduce the generation process of GuardReasoner, run the following code.

  1. generate via vLLM
    CUDA_VISIBLE_DEVICES=0 python generate.py
    
  2. evaluate performance
    python evaluate.py
    

To use GuardReasoner, run the following code.

CUDA_VISIBLE_DEVICES=0 python deploy.py

To reproduce the training process of GuardReasoner, see training pipeline.

Acknowledgement

Our method are partly based on the following resources. Thanks for their awesome works.

Citations

If you find this repository helpful, please cite our paper.

@article{GuardReasoner,
  title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
  author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2501.18492},
  year={2025}
}

@article{GuardReasoner-VL,
  title={GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning},
  author={Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2505.11049},
  year={2025}
}

(back to top)

About

[ICLR Workshop 2025] An official source code for paper "GuardReasoner: Towards Reasoning-based LLM Safeguards".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •