Add TRL GRPO Reasoning with Advanced Reward notebook #319

behroozazarkhalili · 2025-07-26T17:23:27Z

Summary

This notebook demonstrates GRPO (Group Relative Policy Optimization) fine-tuning for mathematical reasoning using TRL library with advanced reward mechanisms.

Key Features

Multi-reward training with 4 different reward functions
Advanced progress tracking with HuggingFace-style interactive tables
Mathematical reasoning on GSM8K dataset
Memory efficient training with 4-bit quantization and LoRA
Structured output generation with reasoning sections

Requirements Checklist

@merveenoyan @stevhliu

Contributed by: Behrooz Azarkhalili

This notebook demonstrates how to use TRL (Transformers Reinforcement Learning) with GRPO (Group Relative Policy Optimization) for reasoning tasks with advanced reward mechanisms. - Added notebook with proper lowercase filename - Updated _toctree.yml and index.md - Added proper author attribution - Cleaned non-informative outputs Contributed by: Behrooz Azarkhalili

review-notebook-app · 2025-07-26T17:23:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

- Remove torch and accelerate from installation (dependencies of TRL) - Remove pad token check (handled automatically) - Restore num_generations to default value (8) - Remove remove_unused_columns parameter (false by default) - Remove processing_class parameter (loaded automatically)

sergiopaniego

Thanks for the addition! 😄
We already have a pretty similar example "Post training an LLM for reasoning with GRPO in TRL".
The idea of the repo is to have end-to-end recipes with extended explanations, so I'd suggest:

Extending the explanations throughout the recipe of the example.
Link the previous example and make a clear distinction between them, explaining it at the beginning. Otherwise, it could lead to confusion for a possible reader looking for an example of GRPO.

The recipes can be opened in Colab and maybe run, so I'd also be nice to keep that in mind. For example when doing os.environ["CUDA_VISIBLE_DEVICES"] = "1" since in Colab there is only 1 GPU.

sergiopaniego · 2025-07-29T12:28:06Z

notebooks/en/index.md

@@ -7,6 +7,7 @@ applications and solving various machine learning tasks using open-source tools

 Check out the recently added notebooks:

+- [TRL GRPO Reasoning with Advanced Reward](trl_grpo_reasoning_advanced_reward)


You can remove the last entry since we aim to have the last 5 here.

Hi @sergiopaniego, I just added the notes you mentioned. I hope the extension and the differences between the two versions make sense now! 😊

…O recipe - Add direct link to existing HuggingFace GRPO cookbook example - Fix CUDA device setting for Colab compatibility (auto-detect instead of hardcoded) - Add comprehensive explanations throughout all recipe sections - Enhance with detailed comparison table showing differences from basic example - Improve GPU setup with memory information and fallback instructions - Add detailed LoRA configuration explanations and parameter analysis - Expand dataset preparation with GSM8K background and format details - Detail multi-reward system design for mathematical reasoning approach - Optimize training configuration with Colab-specific memory settings - Enhance testing and evaluation with detailed response analysis - Make notebook fully end-to-end recipe focused for cookbook standards - Address all reviewer feedback comprehensively for cookbook contribution

qgallouedec reviewed Jul 26, 2025

View reviewed changes

sergiopaniego reviewed Jul 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Add TRL GRPO Reasoning with Advanced Reward notebook #319

behroozazarkhalili commented Jul 26, 2025

Uh oh!

review-notebook-app bot commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Uh oh!

sergiopaniego Jul 29, 2025

Uh oh!

behroozazarkhalili Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

		@@ -7,6 +7,7 @@ applications and solving various machine learning tasks using open-source tools

		Check out the recently added notebooks:

		- [TRL GRPO Reasoning with Advanced Reward](trl_grpo_reasoning_advanced_reward)

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Are you sure you want to change the base?

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Conversation

behroozazarkhalili commented Jul 26, 2025

Summary

Key Features

Requirements Checklist

Uh oh!

review-notebook-app bot commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

behroozazarkhalili Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

behroozazarkhalili Jul 29, 2025 •

edited

Loading