Skip to content

awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.

Notifications You must be signed in to change notification settings

bebetterest/AwesomeCode_on_LLMReasoningRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 

Repository files navigation

AwesomeCode_on_LLMReasoningRL

awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.


ReFT: Reasoning with Reinforced Fine-Tuning (2401.08967)

ReFT-img


Tulu 3: Pushing Frontiers in Open Language Model Post-Training (2411.15124)

RLVR-img


PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards. This work stems from the implicit process reward modeling (PRM) objective. built upon veRL.

RLVR-img


TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. built upon veRL.

(Mini-R1: Philipp reproduced R1 aha moment on countdown as well. built upon trl)


A fully open reproduction of DeepSeek-R1.🤗

open-r1


simpleRL-reason reproduces the training of DeepSeek-R1-Zero and DeepSeek-R1 for complex mathematical reasoning, starting from Qwen-2.5-Math-7B (base model), and only using 8K (query, final answer) examples from the original MATH dataset. built upon OpenRLHF.


apply RL on DeepSeek-R1-Distill-Qwen-1.5B with 30k data (from MATH,NuminaMathCoT, and AIME 1983-2023). built upon OpenRLHF.


RAGEN is a reproduction of the DeepSeek-R1(-Zero) methods for training agentic models. They run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the Gym-Sokoban task.📦 built upon veRL.

RAGEN


reproduce DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset. built upon veRL.


rule-based rl on large-scale coding dataset with an average of 16 test cases per prompt, synthesized by GPT-4o-mini.


open-r1-multimodal


verifier

general

data (any ratable task could be applied)

msg data from long-cot model (r1/qwq...)

others

About

awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published