Note
Official codebase for the paper "Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning". The training code is based on the OpenRLHF framework, and the evaluation code is based on the project Math-Verify.
We propose a novel self-rewarding Reinforcement Learning (RL) framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different response trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers high consistency with minimal deviation toward other candidates low volatility. Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform reinforcement learning in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL.
Normalized distance curve of correct and incorrect trajectories with varying state numbers.
We first derive the two features (consistency and volatility) of a trajectory
Then we define the intrinsic reward and curiosity function as follows:
We recommend using Python >= 3.10 with the following key packages using conda or docker:
- ray>=2.40.0
- torch>=2.5.1
- vllm>=0.7.2
- deepspeed>=0.15.0
- transformers>=4.48.3
- flash-attn==2.7.0.post2
For conda environment, you can run the following command to install the required packages:
cd CoVo/covo
conda create -n covo python=3.10
conda activate covo
pip install -r requirements.txt
For docker environment, you can run the following command to build the docker image and run the docker container:
cd CoVo/covo/dockerfile
docker pull nvcr.io/nvidia/pytorch:24.07-py3
docker build -t covo:v1.0 . # build docker image
cd ../..
docker run --name covo --runtime=nvidia --gpus all -it --shm-size="32g" -v $PWD:/workspace covo:v1.0 bash
First, you need to prepare the initial model weights and download the instruction dataset from this url, and put the data under the directory CoVo/covo/dataset
.
After that, you can run the following command to start training:
ray start --head --node-ip-address 0.0.0.0 # start the ray cluster
cd CoVo/covo
sh examples/scripts/train_reinforce_qwen_ray_riv.sh # training script
We provide the description of key parameters in the training script:
Parameters | Description |
---|---|
--pretrain |
Absolute path to the pretrained model. |
--save_path |
Absolute path to save the trained model. |
--prompt_data |
Absolute path to instructions used for training. |
--eval_data |
Absolute path to evaluation dataset. |
--enable_accuracy_filter |
Filter the prompt that only have all the same sampled answers (i.e., either too easy or too hard). |
--enable_curiosity |
Use curiosity reward for training. |
--intrinsic_reward |
Can only be riv or ril representing vectorial and linear aggression, respectively. |
--logging_path |
The path to save the training logs like rewards, loss, kl divergence, etc. |
--save_output_path |
The path to save sampling results and other information during training. |
Please refer to evaluation
directory for detailed evaluation methods. We provide evaluations across three different reasoning domains using 7 popular benchmarks:
Reasoning Domain | Benchmarks |
---|---|
Mathematics | MATH-500,AMC, GSM8K, Olympiad Bench |
Commonsense | MMLU-Pro, CommonsenseQA |
Science | GPQA |
If you find this repository is useful, please star🌟 this repo and cite🔗 our paper.
@article{zhang2025consistent,
title={Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning},
author={Zhang, Kongcheng and Yao, Qi and Liu, Shunyu and Wang, Yingjie and Lai, Baisheng and Ye, Jieping and Song, Mingli and Tao, Dacheng},
journal={arXiv preprint arXiv:2506.08745},
year={2025}
}
We thank the OpenRLHF for providing the awesome open-source RL infrastructure. We also thank the developers of Qwen, Llama and DeepSeek-R1 for their innovation and contribution to the open-source community.
Please feel free to contact me via email ([email protected]) if you are interested in my research :)