Skip to content

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

License

Notifications You must be signed in to change notification settings

THU-KEG/Agentic-Reward-Modeling

Repository files navigation

Agentic Reward Modeling


Agentic reward modeling is a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent in this repo, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. The overall architecture of RewardAgent is as follows:

RewardAgent Architecture

RewardAgent demonstrates impressive results on reward model benchmarks, best-of-n search, and DPO training. The figure below presents the best-of-n search results, with Llama3-8B-Instruct as the policy model.

Best-of-N Search

For more details, please refer to our paper.


0. Setup

Before running any of the scripts, ensure you have the necessary environment set up. You can install the required dependencies using the requirements.txt file:

pip install -r requirements.txt

If you are using non-API-based LLMs, you need to deploy the LLM locally first. You can refer to the vllm_serve.sh script for deployment. Here is an example of how to deploy a local LLM:

Deploy Llama3-8B-Instruct

CUDA_VISIBLE_DEVICES=2 vllm serve <path> \
    --served-model-name llama-3-8b \
    --port 8001 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Deploy Qwen2.5-Coder 7B

CUDA_VISIBLE_DEVICES=3 vllm serve <path> \
    --served-model-name qwen25-coder-7b \
    --port 8002 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Replace <path> with the actual paths to your models.

1. Reward Modeling Benchmarking

The run_rm_eval_batch.sh script is used to run the reward modeling benchmark. The benchmark data is located in the data/ directory, and the results will be stored in the eval_results/ directory. IFBench is newly constructed and can also be found in here. To run this script, execute (run_rm_eval_batch.sh):

#!/bin/bash
datasets=(RM-Bench/total_dataset.chat_normal_converted.json RM-Bench/total_dataset.chat_hard_converted.json JudgeBench/judgebench-knowledge.json IFBench/data.converted.json)
output_dir_prefixes=(rmbench-chat-normal rmbench-chat-hard judgebench-knowledge ifbench)

model=ArmoRM-Llama3-8B-v0.1
# planner=gpt-4o-mini-2024-07-18
planner=llama-3-8b

judger_type=weighted_sum

# if use openai model, please set the enriron veriable
export OPENAI_BASE_URL="xxx"
export OPENAI_API_KEY="xxx"


# loop datasets和output_dir_prefixes
for i in "${!datasets[@]}"; do
    dataset="${datasets[$i]}"
    output_prefix="${output_dir_prefixes[$i]}"

    echo "dataset: $dataset"
    echo "output: $output_prefix"

    CUDA_VISIBLE_DEVICES=6 python scripts/run_agent_rm.py \
        --pref_sets \
        --trust_remote_code \
        --model ${model} \
        --planner ${planner} \
        --judger_type ${judger_type} \
        --coder qwen25-coder-7b \
        --dataset data/${dataset} \
        --output_dir eval_results/${output_prefix}/reward_agent_${model}_${planner}_${judger_type} \
        --knowledge_source local \
        --num_threads 32

done

2. Best-of-N Search with RewardAgent

Note: Before running run_bon.sh, you need to generate n responses in the best_of_n directory. The script best_of_n/run_generation.sh is provided for this purpose.

To generate n responses, navigate to the best_of_n directory and run (run_generation.sh):

export OPENAI_BASE_URL="xxxx"
export OPENAI_API_KEY="xxx"
python generate.py \
    --input_file data/IFEval/ifeval_input_data.jsonl \
    --save_dir ifeval/gpt-4o-2024-11-20 \
    --model_name_or_path gpt-4o-2024-11-20 \
    --api_model \
    --tempereture 1.0 \
    --n 32

This script will generate the necessary responses for the best-of-n search.

The run_bon.sh script is used to run the best-of-n search with the RewardAgent. This script iterates over different values of n (2, 4, 8, 16, 32) and runs the search for each value. To run this script, execute (run_bon.sh):

for n in 2 4 8 16 32
do
    CUDA_VISIBLE_DEVICES=5 python scripts/run_bon_agent_rm.py \
        --pref_sets \
        --trust_remote_code \
        --model ArmoRM-Llama3-8B-v0.1 \
        --planner llama-3-8b \
        --coder qwen25-coder-7b \
        --judger_type weighted_sum \
        --n $n \
        --dataset reward-agent/best_of_n/ifeval/Llama-3-8B-Instruct/32_responses.jsonl \
        --output_dir eval_results/best_of_n/ifeval/llama3_8b/reward_agent \
        --knowledge_source local \
        --num_threads 64
done

You can change the file path accordingly.

3. Preference Pairs Annotation with RewardAgent

Note: Before running run_annotation.sh, you may also need to generate n responses (on-policy) in the best_of_n directory. The run_annotation.sh script is used to run the annotation of preference pairs with the RewardAgent. This script processes a specific dataset and generates annotated results. To run this script, execute (run_annotation.sh):

file="8_responses"

CUDA_VISIBLE_DEVICES=5 python scripts/run_annotation.py \
    --pref_sets \
    --trust_remote_code \
    --model ArmoRM-Llama3-8B-v0.1 \
    --planner llama-3-8b \
    --coder qwen25-coder-7b \
    --judger_type weighted_sum \
    --dataset reward-agent/best_of_n/UltraFeedback/zephyr-7b-sft-full/${file}.json \
    --output_dir reward-agent/best_of_n/UltraFeedback/zephyr-7b-sft-full/reann \
    --output_file ${file}.jsonl \
    --knowledge_source local \
    --n 64

You can change the file path accordingly.

4. Deploying Verifiable Signals as a Service

You can also deploy the verifiable signals as a service for plug-and-play integration with existing reward models. This allows you to combine the verifiable signals with reward models to participate in RL training seamlessly.

To deploy the verifiable signals as a service, you can use the reward_agent/server.py script. This script sets up a web service that exposes the verifiable signals via HTTP endpoints. Once the service is running, you can integrate it with your RL training pipeline. The service provides endpoints for accessing the verifiable signals.

5. Acknowledgements

Our repository references the RewardBench repository. We appreciate the valuable insights and foundational work provided by the RewardBench team.

6. Citation

If you find our repository useful, kindly cite:

@article{peng2025agentic,
  title={Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems},
  author={Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Yao, Zijun and Xu, Bin and Hou, Lei and Li, Juanzi},
  journal={arXiv preprint arXiv:2502.19328},
  year={2025}
}

About

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published