Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen,
Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi
As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner.
- (2025/03/06) The paper has been accepted by the ICLR 2025 FM-Wild Workshop.
- (2025/02/02) The training pipeline is released.
- (2025/02/01) The training data GuardReasonerTrain is released.
- (2025/01/31) The models are released (1B, 3B, 8B).
- (2025/01/31) The code of GuardReasoner is released.
- (2025/01/31) GuardReasoner is on arXiv.
To evaluate GuardReasoner, run the following code.
python ./evaluate.py
Table 1: Performance on Prompt Harmfulness Detection Task.
Table 2: Performance of Response Harmfulness Detection Task.
Table 3: Performance on Refusal Detection Task.
To reproduce the generation process of GuardReasoner, run the following code.
- generate via vLLM
CUDA_VISIBLE_DEVICES=0 python generate.py
- evaluate performance
python evaluate.py
To use GuardReasoner, run the following code.
CUDA_VISIBLE_DEVICES=0 python deploy.py
To reproduce the training process of GuardReasoner, see training pipeline.
Our method are partly based on the following resources. Thanks for their awesome works.
If you find this repository helpful, please cite our paper.
@article{GuardReasoner,
title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
journal={arXiv preprint arXiv:2501.18492},
year={2025}
}
@article{GuardReasoner-VL,
title={GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning},
author={Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan},
journal={arXiv preprint arXiv:2505.11049},
year={2025}
}