With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, we propose ReST-RL, a general RL framework which implements a two-stage reinforcement learning pipeline for large language models (LLMs):
- Stage 1 — Self-Training (Policy Improvement via ReST-GRPO): Sample on target datasets, process the generated completions into compatible synthesized data, and train the policy with an optimized group-relative policy optimization routine.
- Stage 2 — Value Model Training and Assisted Decoding with VM-MCTS: Collect reward signals with MCTS-based sampling, process them into value targets, and train a Value Model.
Finally, the trained policy and value model are evaluated on held-out tasks.
- Unified interfaces for sampling, processing, training, and evaluation
- Multiple datasets: BigCodeBench, DS1000, APPS and any other compatible dataset
- vLLM support for efficient generation and tensor-parallel inference
- ReST-GRPO for self-training, enabling high-signal reward data collection
- Transformer-based reward models trained with DeepSpeed, general support for transformers
experiment/sample.py
: Common sampling for Stage 1experiment/process_gen_data.py
: Process generated data for GRPO/DPO/SFT or MCTS rewardsmodels/train_grpo.py
: GRPO training entrypoint for policymodels/run_grpo.sh
: Example command for GRPO trainingexperiment/sample_mcts.py
: MCTS sampling for Stage 2rms/train.py
: Value(Reward) model trainingrms/train.sh
: Example command for value model trainingevaluation/eval.py
: Unified evaluation for LLM policy/value model assisted LLM
- Python packages are listed in
requirements.txt
. - Recommended: CUDA-enabled environment for local vLLM and model training.
# From repository root
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
If you plan to use local model inference instead of API, ensure GPUs are available. When --use_api
is omitted, scripts assume local inference and will configure vLLM across available CUDA devices.
Prepare datasets under data/
following the expected file names used by readers:
- BigCodeBench:
data/BigCodeBench/data.json
- DS1000:
data/DS1000/data.jsonl
- APPS:
data/APPS/data_with_test.jsonl
for training anddata/APPS/test_500.jsonl
for evaluation
Note: The original data files have been compressed into datafiles.zip
in the repository root. To use the datasets, please extract the zip file.
You can also override paths with --directory
and --file
in the sampling/evaluation scripts, or add your own source dataset for training.
Generates multiple completions per problem with on-the-fly verification.
# Example: sample on BigCodeBench with a local model (vLLM)
python -m experiment.sample \
--domain BigCodeBench \
--backend Qwen/Qwen2.5-Coder-7B-Instruct \
--n 5 \
--temperature 0.7 \
--max_tokens 1024
# Switch dataset
# --domain DS1000 or --domain APPS
# Optional: --directory /abs/path/to/data --file data.jsonl
# Optional: --idx_list 0 1 2 3 (use subset)
# Use API instead of local model: add --use_api and set --backend accordingly
Outputs are written to:
generate/Common/{BigCodeBench|DS1000|APPS}/{backend}/temp_{temperature}_tokens_{max_tokens}_completions.jsonl
Converts sampling outputs into GRPO-ready train files. Modes:
--mode grpo
- Optional sub-sampling via
--n_sample
and exponential decay--alpha
# Aggregate across datasets and build GRPO data
python -m experiment.process_gen_data \
--domain code \
--method Common \
--mode grpo \
--backend Qwen/Qwen2.5-Coder-7B-Instruct \
--temperature 0.7 \
--max_tokens 1024 \
--std_accept_threshold_grpo 0.05 \
--completion_accept_threshold_grpo 0.9 \
--n_sample 0.5 \
--alpha 0.95 \
--do_aggregate
# GRPO data will be saved under
# generate/Common/{Domain}/{backend}/temp_*_grpo_data_*.jsonl
# and when aggregated under
# generate/Common/All/{backend}/temp_*_grpo_data_*.jsonl
Use Accelerate and optional DeepSpeed, following models/run_grpo.sh
.
# Example (mirrors models/run_grpo.sh)
accelerate launch models/train_grpo.py \
"Qwen/Qwen2.5-Coder-7B-Instruct" \
"models/ckpts/grpo/Qwen2.5-Coder-7B-Instruct/1" \
"generate/Common/All/Qwen--Qwen2.5-Coder-7B-Instruct/temp_0.7_tokens_1024_grpo_data_0.05_0.9_0.5_0.95.jsonl" \
--max_prompt_length 1024 \
--max_completion_length 1024 \
--num_generations 8 \
--log_completions \
--deepspeed_config config/zero2_config.json \
--lr 1e-7 \
--epochs 1 \
--save_steps 0.5 \
--batch_size_per_device 2 \
--gradient_accumulation_steps 1 \
--symbol_reward 1e-3 \
--trailing_penalty 1e-6 \
--report_to wandb
Checkpoints will be saved to save_dir
, e.g. models/ckpts/grpo/Qwen2.5-Coder-7B-Instruct/1
.
Searches the code space with MCTS and collects both terminal and intermediate rewards.
python -m experiment.sample_mcts \
--domain APPS \
--backend Qwen/Qwen2.5-Coder-7B-Instruct \
--iteration_limit 100 \
--num_sample 5 \
--num_decision 5 \
--exploration_constant 0.2 \
--eps 0.1 \
--temperature 0.7 \
--max_tokens 1024
# Outputs under
# generate/MCTS/{Domain}/{backend}/time_*_iter_*_sample_*_..._tokens_*_thought_{yes|no}.jsonl
Transforms MCTS traces into reward supervision files.
python -m experiment.process_gen_data \
--domain code \
--method MCTS \
--backend Qwen/Qwen2.5-Coder-7B-Instruct \
--temperature 0.7 \
--max_tokens 1024 \
--do_aggregate
# Reward files written per-domain and (optionally) aggregated under
# generate/MCTS/All/{backend}/*_rewards.jsonl
Default class is transformers_scalar
with the deepspeed
training paradigm.
# Example (mirrors rms/train.sh)
deepspeed rms/train.py \
Qwen/Qwen2.5-Coder-7B-Instruct \
rms/ckpts/Qwen2.5-Coder-7B-Instruct/1 \
generate/MCTS/All/Qwen--Qwen2.5-Coder-7B-Instruct/time_None_iter_100_sample_5_decision_3_exp_0.2_eps_0.1_temp_0.7_tokens_1024_thought_no_domains_BigCodeBench_DS1000_APPS_rewards.jsonl \
transformers_scalar \
--max_length 2048 \
--deepspeed_config rms/config/zero3_config.json \
--lr 1e-7 \
--epochs 2 \
--save_steps 0.5 \
--batch_size_per_device 1
Adjust --rm_class
to standard_scalar
or transformers_prob
as needed, but note the supported implementations in rms/train.py
(you may also implement some rms by yourself). We recommend simply using transformers_scalar
.
Evaluate the trained policy (optionally with a value/reward model) on APPS.
# Common evaluation without reward model
after_policy_backend="ckpts/grpo/Qwen2.5-Coder-7B-Instruct/1" # or a HF model id
python -m evaluation.eval \
--method Common \
--domain APPS \
--backend ${after_policy_backend} \
--file test_500.jsonl \
--n 1 \
--num_sample 5 \
--temperature 0.7 \
--max_tokens 1024
# Common evaluation with other reward models
python -m evaluation.eval \
--method Common \
--domain APPS \
--backend ${after_policy_backend} \
--rm_backend Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
--rm_class skywork \
--rm_type orm \
--max_length 2048 \
--file test_500.jsonl \
--n 1 \
--num_sample 5 \
--temperature 0.7 \
--max_tokens 1024
# VM-MCTS assisted decoding (requires value model)
python -m evaluation.eval \
--method MCTS \
--domain APPS \
--backend ${after_policy_backend} \
--rm_backend ckpts/rm/Qwen2.5-Coder-7B-Instruct/1 \
--rm_class transformers_scalar \
--rm_type prm \
--max_length 2048 \
--iteration_limit 15 \
--num_decision 5 \
--exploration_constant 0.1 \
--eps 0.1 \
--n 1 \
--num_sample 5 \
--temperature 0.7 \
--max_tokens 1024
Evaluation writes to output/apps_results/{Common|MCTS}/{backend}/
:
*_completions.jsonl
,*_verified.jsonl
, and*_results.jsonl
(final metrics)
- GPU allocation: Scripts automatically detect CUDA. For common evaluation with RM or for MCTS, at least 2 GPUs are required (policy and RM on separate devices). See logic in
evaluation/eval.py
for vLLM tensor-parallel sizing. - API vs local: Add
--use_api
to call remote backends; otherwise vLLM is used locally with--vllm_tensor_parallel_size
and--vllm_gpu_memory_utilization
. - Formatting and stops: If your tokenizer lacks a chat template, the code logs a warning. Stop strings may be dataset-dependent and auto-filled by
prompts/stops.get_stop_strings
.
To promote relevant research, we have released the ReST-RL reinforced Qwen3-8B model and its corresponding value model on huggingface.
If you use this repository in your research, please cite this project.
@misc{zhoubian2025restrlachievingaccuratecode,
title={ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding},
author={Sining Zhoubian and Dan Zhang and Jie Tang},
year={2025},
eprint={2508.19576},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.19576},
}
- Zhipu AI
- BigCodeBench, DS1000, APPS datasets
- Evalplus for benchmark evaluation
- Hugging Face Transformers and Accelerate
- DeepSpeed and vLLM