[CGPO] CGPO Trainer (single task single objective) #2190

gaetanlop · 2024-10-06T19:28:44Z

What does this PR do?

This PR introduces the CGPOTrainer to the trl library, as introduced in the paper CGPO paper.
The current implementation follows the approach of section 4.1 of the paper (CGPO in Single Task with Single Objective) and adds the three possible policy optimization methods described: codpo, crraft and crpg.

This PR was initially discussed in #2156.
This PR depends on #2159 which introduces the MOJs.

I plan to address "Multi-Taks with Multi-Objectives" part (section 4.2) in a separate PR once this one has been successfully merged and tested.

Note: This PR relies on the DataCollatorForChatML which has a small bug. Refer to #2169.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

To do

Log metrics
Mini batch in crpg policy optimization
Approval of 🤝 Mixture of judges #2159
Add documentation
Add a more comprehensive test suites
Handle case in CODPO where no completions satisfy all constraints.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

gaetanlop · 2024-10-11T02:23:02Z

Some judges in CGPOTrainer need gold answers or metadata to make decisions. This gold answer can be used to compare with the policy's output or store metadata for rule-based judges (examples of this metadata are in Table 4).

I’m using the DataCollatorForChatML in the CGPOTrainer right now. Should we create a new dataset format with a prompt, completion, and gold answer in a separate PR, or should we modify DataCollatorForChatML to return the non-tokenized gold answer along with the current parameters?

What do you prefer ?@qgallouedec @kashif

gaetanlop added 2 commits October 6, 2024 15:06

initial skeleton

2427cd7

run precommit

cfbd68e

gaetanlop marked this pull request as draft October 6, 2024 19:29

gaetanlop mentioned this pull request Oct 6, 2024

[CGPO] Calibrated reward #2155

Closed

4 tasks

kashif added the ✨ enhancement New feature or request label Oct 6, 2024

gaetanlop and others added 23 commits October 6, 2024 17:11

cgpo doc

ec30a3a

add constraint loggings

970f64a

better test suites

a588956

crpg mini batch

4f222c4

decode prompts instead of returning in text in collator

5a9fde5

adding support for codpo edge cases and add tests for edge cases

e47991a

update doc

d4e1c36

better test suites

0ac178f

add more tests

63d214e

update doc

0a1d94d

Merge branch 'main' into cgpo_trainer

48b9d17

Merge branch 'main' into cgpo_trainer

d21be91

replace tokenizer with processing class

187f042

fixing small issues

c796e8d

mini batch policy optimization

b9c2fca

adding local mini batch for crraft

40604bb

reformatting

f01de47

reformating

89f6e2d

Merge branch 'main' into cgpo_trainer

ed73f5e

raise error message when wrong mini batch size

4108d24

adding mini batch testing

9fa49ae

small update to the config

6e6b6b6

fix small typos

38e41f8

gaetanlop marked this pull request as ready for review October 11, 2024 01:27

gaetanlop marked this pull request as draft October 11, 2024 01:57

gaetanlop mentioned this pull request Oct 11, 2024

[CGPO] Add support for Constrained Generative Policy Optimization #2156

Open

3 tasks

gaetanlop and others added 9 commits October 13, 2024 18:46

Merge branch 'main' into cgpo_trainer

a89bcdd

Merge branch 'main' into cgpo_trainer

0a58d14

Merge branch 'main' into cgpo_trainer

d853422

adding gold key to collator

693e9f8

fixing acc in crpg

070b2c0

remove unsused columns

eab05d7

Merge branch 'main' into cgpo_trainer

7231abb

test gold answer and update policies

cb1c41e

kl threshold

a4a7ceb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CGPO] CGPO Trainer (single task single objective) #2190

[CGPO] CGPO Trainer (single task single objective) #2190

gaetanlop commented Oct 6, 2024 •

edited

Loading

gaetanlop commented Oct 11, 2024 •

edited

Loading

[CGPO] CGPO Trainer (single task single objective) #2190

Are you sure you want to change the base?

[CGPO] CGPO Trainer (single task single objective) #2190

Conversation

gaetanlop commented Oct 6, 2024 • edited Loading

What does this PR do?

Before submitting

To do

Who can review?

gaetanlop commented Oct 11, 2024 • edited Loading

gaetanlop commented Oct 6, 2024 •

edited

Loading

gaetanlop commented Oct 11, 2024 •

edited

Loading