Process-supervised RM Trainer #2127

gaetanlop · 2024-09-26T02:37:18Z

What does this PR do?

Adding support for process-supervised reward training to TRL as requested in #2110 .

List of papers using PRMs: [1], [2], [3], [4]...

Fixes # (issue)

#2110

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

@lewtun @kashif

lewtun · 2024-09-26T14:25:27Z

This is awesome @gaetanlop ! Would you like some early feedback on the PR or would you prefer I wait a bit until it's more polished?

gaetanlop · 2024-09-27T02:54:20Z

Hey @lewtun, thank you for the message. Currently, the only files that are more or less ready are prm_trainer.py and prm_config.py. The rest are just placeholders that I haven’t had the opportunity to work on yet.

Implementing a PRMs seems to be pretty straighforward, it seems to be a token classification task where only prediction for the last token of each step gets assigned a label and other tokens are ignored during loss calculation.

If the dataset isn’t pre-tokenized, I assume it should contain the following columns:

prompt: Either a string or past messages
steps: A list of strings
labels: A list of integers corresponding to the label associated to each step

Are you aware of an HF dataset to train PRMs for the example file? Also, how can I add a new subset to the trl-internal-testing/zen dataset to support stepwise reward models for the unit test of the prm_trainer?

Thanks again for your time!

gaetanlop · 2024-09-28T18:37:01Z

PR ready for review. I have changed the naming conventions that I used before prm to the suggested naming in #2110 stepwise.

Tests: I created a dummy_dataset but we should add a subset to trl-internal-testing/zen as done in other scripts.
Example: The example is currently using a placeholder for the dataset name as to the best of my knowledge trl didn't release a dataset for stepwise reasoning on HF. We should add this too.

lewtun

Thank you for the very clean PR @gaetanlop - this looks great! I've left some minor suggestions regarding the structure, but aside from that and having a smallish dataset in the right format we can sanity check that the accuracy goes up, loss goes down etc I think this is quite close to being ready

docs/source/_toctree.yml

docs/source/stepwise_reward_trainer.mdx

docs/source/dataset_formats.mdx

lewtun · 2024-09-30T08:07:55Z

examples/scripts/stepwise_reward_modeling.py

+Full training:
+python examples/scripts/stepwise_reward_modeling.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --dataset_name trl-lib/PLACEHOLDER \


What do you think about picking a subset from PRM800k to test everything works?

You could create a subset in the expected format and then we can merge it with trl-lib/zen :)

I made two pull requests to trl-lib/zen (https://huggingface.co/datasets/trl-lib/zen/discussions/3) to add the subsets to trl-lib.

trl/trainer/stepwise_reward_config.py

trl/trainer/stepwise_reward_trainer.py

gaetanlop · 2024-10-01T01:53:21Z

Thanks for looking at this @lewtun. Seems like trl-internal-testing/zen is the dataset you are using for testing. I have done a PR to trl-lib/zen, should I also PR trl-internal-testing/zen to add 19 samples of PRM800K for testing or are you handling it on your side (it looks like they are both the same dataset)?

Co-authored-by: Quentin Gallouédec <[email protected]>

…mtrainer

yiyepiaoling0715 · 2024-11-16T03:34:16Z

hi,a good job! when will this be merged?

qgallouedec · 2024-11-18T13:20:38Z

@gaetanlop #2148 is merged, let's move on to this one now. Are you still interested in contributing?

qgallouedec · 2024-11-18T13:21:45Z

docs/source/dataset_formats.mdx

+### Stepwise preference
+
+A stepwise preference dataset is similar to an unpaired preference dataset but instead of having a single `"completion"` and `"label"`, it includes a `"completion"` column that splits the completion into a list of steps and a `"labels"` column indicating whether each step is correct or not.
+
+```python
+steps_preference_example = {"prompt": "The sky is", "completion": [", let me think...", "blue."], "labels": [False, True]}
+```
+


Suggested change

### Stepwise preference

A stepwise preference dataset is similar to an unpaired preference dataset but instead of having a single `"completion"` and `"label"`, it includes a `"completion"` column that splits the completion into a list of steps and a `"labels"` column indicating whether each step is correct or not.

```python

steps_preference_example = {"prompt": "The sky is", "completion": [", let me think...", "blue."], "labels": [False, True]}

```

Remove in favour of "Stepwise supervision"

qgallouedec · 2024-11-18T13:23:17Z

examples/datasets/openai_prm800k.py

@@ -0,0 +1,54 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.


You can remove this file in favour of https://github.com/huggingface/trl/blob/main/examples/datasets/prm800k.py

qgallouedec · 2024-11-18T13:25:12Z

trl/trainer/stepwise_reward_trainer.py

+        if type(args) is not StepwiseRewardConfig:
+            raise ValueError(f"args should be an instance of `StepwiseRewardConfig` but got {type(args)}")


Suggested change

if type(args) is not StepwiseRewardConfig:

raise ValueError(f"args should be an instance of `StepwiseRewardConfig` but got {type(args)}")

qgallouedec · 2024-11-18T13:29:24Z

trl/trainer/stepwise_reward_trainer.py

+        @article{uesato2022solving,
+        title={Solving math word problems with process-and outcome-based feedback},
+        author={Uesato, Jonathan and Kushman, Nate and Kumar, Ramana and Song, Francis and Siegel, Noah and Wang, Lisa and Creswell, Antonia and Irving, Geoffrey and Higgins, Irina},
+        journal={arXiv preprint arXiv:2211.14275},
+        year={2022}
+        }"""


Suggested change

@article{uesato2022solving,

title={Solving math word problems with process-and outcome-based feedback},

author={Uesato, Jonathan and Kushman, Nate and Kumar, Ramana and Song, Francis and Siegel, Noah and Wang, Lisa and Creswell, Antonia and Irving, Geoffrey and Higgins, Irina},

journal={arXiv preprint arXiv:2211.14275},

year={2022}

}"""

@article{uesato2022solving,

title = {Solving Math Word Problems With Process- and Outcome-Based Feedback},

author = {Uesato, Jonathan and Kushman, Nate and Kumar, Ramana and Song, Francis and Siegel, Noah and Wang, Lisa and Creswell, Antonia and Irving, Geoffrey and Higgins, Irina},

year = 2022,

journal = {arXiv preprint arXiv:2211.14275}

}"""

lewtun

Thanks for iterating @gaetanlop and apologies for the slow review on this one 🙈 ! Overall this is looking really good and with some minor changes I think it's close to being ready

lewtun · 2024-11-18T14:53:16Z

docs/source/stepwise_reward_trainer.mdx

+
+> Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use processbased supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% → 12.7% final-answer error and 14.0% → 3.4% reasoning error among final-answer-correct solutions.
+
+This post-training method was contributed by [Gaetan Lopez](https://github.com/gaetanlop), [Lewis Tunstall](https://huggingface.co/lewtun) and [Quentin Gallouédec](https://huggingface.co/qgallouedec)


Feel free to remove me since you did all the work on the implementation side :)

lewtun · 2024-11-18T14:54:43Z

docs/source/stepwise_reward_trainer.mdx

+
+## Overview
+
+Process-supervised Reward Models (PRMs) were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.


nit since we don't need the acronym:

Suggested change

Process-supervised Reward Models (PRMs) were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.

Stepwise or process reward models were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.

lewtun · 2024-11-18T14:55:12Z

docs/source/stepwise_reward_trainer.mdx

+
+The [`StepwiseRewardTrainer`] is a wrapper around the [`Trainer`] class. It needs two parameters to be set via the [`StepwiseRewardConfig`] namely:
+* `max_length`: controls the maximum length of the sequences where a sequence is composed of the prompt and the concatenation of each completion steps.
+* `step_separator`: indicate the separator used to separate each step of the reasoning process. By default, it is set to `"n"`.


shouldn't this be on new lines?

Suggested change

* `step_separator`: indicate the separator used to separate each step of the reasoning process. By default, it is set to `"n"`.

* `step_separator`: indicates the separator used to separate each step of the reasoning process. By default, it is set to `"\n"`.

lewtun · 2024-11-18T14:55:32Z

docs/source/stepwise_reward_trainer.mdx

+        "prompt": [
+            "Hi, how are you?",
+        ],
+        "completion": [


Suggested change

"completion": [

"completions": [

lewtun · 2024-11-18T14:56:27Z

docs/source/stepwise_reward_trainer.mdx

+
+model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B-Instruct", num_labels=2)
+
+train_dataset = Dataset.from_dict(


WDYT about using a math example like the one here? 76dbb1a#diff-9401f539a830b066fdca010e21b44ba7b439404436e3ed18c5dbea9dff582bf5R83-R88

I personally find this a bit easier to follow

lewtun · 2024-11-18T15:11:49Z

docs/source/stepwise_reward_trainer.mdx

+
+## Expected dataset format
+
+The dataset should be formatted as a [Name to find](dataset_formats#[Name to find]) which implies that the dataset should contain the following columns: `prompt`, `completion` and `labels` where `completion` contains a list of reasoning steps and `labels` a list of booleans indicating the correctness of each step.


Suggested change

The dataset should be formatted as a [Name to find](dataset_formats#[Name to find]) which implies that the dataset should contain the following columns: `prompt`, `completion` and `labels` where `completion` contains a list of reasoning steps and `labels` a list of booleans indicating the correctness of each step.

The dataset should be formatted as a [Stepwise Supervision](dataset_formats#stepwise-supervision) dataset, which implies that it should contain the following columns: `prompt`, `completions` and `labels`, where `completions` contains a list of reasoning steps and `labels` a list of booleans or floats indicating the correctness of each step.

lewtun · 2024-11-18T15:12:49Z

examples/scripts/stepwise_reward_modeling.py

+Full training:
+python examples/scripts/stepwise_reward_modeling.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --dataset_name trl-lib/openai-prm800k-15k \


Suggested change

--dataset_name trl-lib/openai-prm800k-15k \

--dataset_name trl-lib/prm800k \

lewtun · 2024-11-18T15:13:12Z

examples/scripts/stepwise_reward_modeling.py

+LoRA:
+python examples/scripts/stepwise_reward_modeling.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --dataset_name trl-lib/openai-prm800k-15k \


Suggested change

--dataset_name trl-lib/openai-prm800k-15k \

--dataset_name trl-lib/prm800k \

lewtun · 2024-11-18T15:13:56Z

examples/scripts/stepwise_reward_modeling.py

+        model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, use_fast=True
+    )
+    model = AutoModelForTokenClassification.from_pretrained(
+        model_config.model_name_or_path, num_labels=3, trust_remote_code=model_config.trust_remote_code, **model_kwargs


With the new format, shouldn't this be just two labels?

Suggested change

model_config.model_name_or_path, num_labels=3, trust_remote_code=model_config.trust_remote_code, **model_kwargs

model_config.model_name_or_path, num_labels=2, trust_remote_code=model_config.trust_remote_code, **model_kwargs

lewtun · 2024-11-18T15:15:01Z

examples/scripts/stepwise_reward_modeling.py

+    --max_length 2048
+
+LoRA:
+python examples/scripts/stepwise_reward_modeling.py \


If you have some compute, can you share some WandB logs from running these scripts? Otherwise I can run them myself :)

gaetanlop added 3 commits September 25, 2024 22:29

initial skeleton

357a8c6

tokenize fn

841f7a1

adding bos and eos to tokenization fn

641e899

gaetanlop marked this pull request as draft September 26, 2024 03:15

gaetanlop added 3 commits September 26, 2024 22:32

prmtrainer

106bc0e

fixing small typo in tokenize

0163dcc

typo in input_ids and labels construction

c2720d7

gaetanlop added 12 commits September 26, 2024 22:58

numpy dimension

5034083

introduce the stepwise reward trainer

8818b6a

update markdown files

b777d1c

let user decide post step separator in config

afa9e0a

doc post_step_separator

2dd752d

do not add post step_tokens to last step of the reasoning process

613d838

renaming prm to stepwisereward

b96ef4d

formatting

161f5de

fix tokenize kwargs

93e6652

adapt test to the new post_token args

3ec4ebe

adding example script

1461a61

fix small typo

8c4ac31

gaetanlop marked this pull request as ready for review September 28, 2024 18:34

lewtun reviewed Sep 30, 2024

View reviewed changes

gaetanlop added 2 commits September 30, 2024 20:33

add create_model_card and renaming

8b3fa52

fixing booleans

8e4e159

gaetanlop changed the title ~~[DRAFT] Process-supervised RM Trainer~~ Process-supervised RM Trainer Oct 1, 2024

gaetanlop added 2 commits September 30, 2024 21:46

Adding the new stepwise_preference instead of placeholders for datasets

c60bc40

formatting

614fb4e

gaetanlop and others added 22 commits October 12, 2024 15:52

Update docs/source/_toctree.yml

b00e32b

Co-authored-by: Quentin Gallouédec <[email protected]>

Update examples/scripts/stepwise_reward_modeling.py

d5f780a

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/stepwise_reward_trainer.py

f02056a

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/stepwise_reward_trainer.py

3ac323f

Co-authored-by: Quentin Gallouédec <[email protected]>

update push to hub

436dfd7

Co-authored-by: Quentin Gallouédec <[email protected]>

step_separator can't be None

f4e6d4e

Co-authored-by: Quentin Gallouédec <[email protected]>

Merge branch 'main' into prmtrainer

6947aef

fix suggested typos

e0c0648

add citation

35de0ee

reformat doc

c3eb08e

reordering init

898f621

push to hub prm800k

3a488e0

changing dataset in example

a03aed8

change dataset format to align with the sky is blue example

e77eee2

Merge branch 'main' into prmtrainer

6c62c69

fix tokenization column names

e8e93f1

fix num labels in openai example

2059c51

add support for conversational dataset

701241b

remove training whitespace

6bb467b

Merge branch 'main' into prmtrainer

6b2bd97

replace tokenizer with processing class

2030a83

Merge branch 'prmtrainer' of https://github.com/gaetanlop/trl into pr…

66baada

…mtrainer

Merge branch 'main' into prmtrainer

b47eea5

qgallouedec reviewed Nov 18, 2024

View reviewed changes

lewtun reviewed Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process-supervised RM Trainer #2127

Process-supervised RM Trainer #2127

gaetanlop commented Sep 26, 2024 •

edited

Loading

lewtun commented Sep 26, 2024

gaetanlop commented Sep 27, 2024

gaetanlop commented Sep 28, 2024 •

edited

Loading

lewtun left a comment

lewtun Sep 30, 2024

gaetanlop Oct 1, 2024 •

edited

Loading

qgallouedec Oct 1, 2024

gaetanlop commented Oct 1, 2024 •

edited

Loading

yiyepiaoling0715 commented Nov 16, 2024 •

edited

Loading

qgallouedec commented Nov 18, 2024

qgallouedec Nov 18, 2024

qgallouedec Nov 18, 2024

qgallouedec Nov 18, 2024

qgallouedec Nov 18, 2024

lewtun left a comment

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

lewtun Nov 18, 2024

		@@ -0,0 +1,54 @@
		# Copyright 2024 The HuggingFace Inc. team. All rights reserved.

		if type(args) is not StepwiseRewardConfig:
		raise ValueError(f"args should be an instance of `StepwiseRewardConfig` but got {type(args)}")


		> Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use processbased supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% → 12.7% final-answer error and 14.0% → 3.4% reasoning error among final-answer-correct solutions.

		This post-training method was contributed by [Gaetan Lopez](https://github.com/gaetanlop), [Lewis Tunstall](https://huggingface.co/lewtun) and [Quentin Gallouédec](https://huggingface.co/qgallouedec)


		## Overview

		Process-supervised Reward Models (PRMs) were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.

	Process-supervised Reward Models (PRMs) were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.
	Stepwise or process reward models were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.

	* `step_separator`: indicate the separator used to separate each step of the reasoning process. By default, it is set to `"n"`.
	* `step_separator`: indicates the separator used to separate each step of the reasoning process. By default, it is set to `"\n"`.


		model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B-Instruct", num_labels=2)

		train_dataset = Dataset.from_dict(


		## Expected dataset format

		The dataset should be formatted as a [Name to find](dataset_formats#[Name to find]) which implies that the dataset should contain the following columns: `prompt`, `completion` and `labels` where `completion` contains a list of reasoning steps and `labels` a list of booleans indicating the correctness of each step.

	The dataset should be formatted as a [Name to find](dataset_formats#[Name to find]) which implies that the dataset should contain the following columns: `prompt`, `completion` and `labels` where `completion` contains a list of reasoning steps and `labels` a list of booleans indicating the correctness of each step.
	The dataset should be formatted as a [Stepwise Supervision](dataset_formats#stepwise-supervision) dataset, which implies that it should contain the following columns: `prompt`, `completions` and `labels`, where `completions` contains a list of reasoning steps and `labels` a list of booleans or floats indicating the correctness of each step.

	--dataset_name trl-lib/openai-prm800k-15k \
	--dataset_name trl-lib/prm800k \

	model_config.model_name_or_path, num_labels=3, trust_remote_code=model_config.trust_remote_code, **model_kwargs
	model_config.model_name_or_path, num_labels=2, trust_remote_code=model_config.trust_remote_code, **model_kwargs

Process-supervised RM Trainer #2127

Are you sure you want to change the base?

Process-supervised RM Trainer #2127

Conversation

gaetanlop commented Sep 26, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

lewtun commented Sep 26, 2024

gaetanlop commented Sep 27, 2024

gaetanlop commented Sep 28, 2024 • edited Loading

lewtun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaetanlop Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaetanlop commented Oct 1, 2024 • edited Loading

yiyepiaoling0715 commented Nov 16, 2024 • edited Loading

qgallouedec commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaetanlop commented Sep 26, 2024 •

edited

Loading

gaetanlop commented Sep 28, 2024 •

edited

Loading

gaetanlop Oct 1, 2024 •

edited

Loading

gaetanlop commented Oct 1, 2024 •

edited

Loading

yiyepiaoling0715 commented Nov 16, 2024 •

edited

Loading