GitHub - yueliu1999/GuardReasoner: [ICLR Workshop 2025] An official source code for paper "GuardReasoner: Towards Reasoning-based LLM Safeguards".

GuardReasoner: Towards Reasoning-based LLM Safeguards

Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen,
Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi

As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner.

Update

(2025/03/06) The paper has been accepted by the ICLR 2025 FM-Wild Workshop.
(2025/02/02) The training pipeline is released.
(2025/02/01) The training data GuardReasonerTrain is released.
(2025/01/31) The models are released (1B, 3B, 8B).
(2025/01/31) The code of GuardReasoner is released.
(2025/01/31) GuardReasoner is on arXiv.

Usage

Quick Start

To evaluate GuardReasoner, run the following code.

python ./evaluate.py

Main Result

Table 1: Performance on Prompt Harmfulness Detection Task.

Table 2: Performance of Response Harmfulness Detection Task.

Table 3: Performance on Refusal Detection Task.

Development Version

To reproduce the generation process of GuardReasoner, run the following code.

generate via vLLM

CUDA_VISIBLE_DEVICES=0 python generate.py

evaluate performance
```
python evaluate.py
```

To use GuardReasoner, run the following code.

CUDA_VISIBLE_DEVICES=0 python deploy.py

To reproduce the training process of GuardReasoner, see training pipeline.

Acknowledgement

Our method are partly based on the following resources. Thanks for their awesome works.

Citations

If you find this repository helpful, please cite our paper.

@article{GuardReasoner,
  title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
  author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2501.18492},
  year={2025}
}

@article{GuardReasoner-VL,
  title={GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning},
  author={Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2505.11049},
  year={2025}
}

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
data		data
train		train
README.md		README.md
deploy.py		deploy.py
evaluate.py		evaluate.py
generate.py		generate.py
template.py		template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GuardReasoner: Towards Reasoning-based LLM Safeguards

Update

Usage

Quick Start

Main Result

Development Version

Acknowledgement

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

yueliu1999/GuardReasoner

Folders and files

Latest commit

History

Repository files navigation

GuardReasoner: Towards Reasoning-based LLM Safeguards

Update

Usage

Quick Start

Main Result

Development Version

Acknowledgement

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages