Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CGPO] CGPO Trainer (single task single objective) #2190

Draft
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

gaetanlop
Copy link
Contributor

@gaetanlop gaetanlop commented Oct 6, 2024

What does this PR do?

This PR introduces the CGPOTrainer to the trl library, as introduced in the paper CGPO paper.
The current implementation follows the approach of section 4.1 of the paper (CGPO in Single Task with Single Objective) and adds the three possible policy optimization methods described: codpo, crraft and crpg.

This PR was initially discussed in #2156.
This PR depends on #2159 which introduces the MOJs.

I plan to address "Multi-Taks with Multi-Objectives" part (section 4.2) in a separate PR once this one has been successfully merged and tested.

Note: This PR relies on the DataCollatorForChatML which has a small bug. Refer to #2169.

Before submitting

  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

To do

  • Log metrics
  • Mini batch in crpg policy optimization
  • Approval of 🤝 Mixture of judges #2159
  • Add documentation
  • Add a more comprehensive test suites
  • Handle case in CODPO where no completions satisfy all constraints.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@gaetanlop gaetanlop marked this pull request as draft October 6, 2024 19:29
@gaetanlop gaetanlop mentioned this pull request Oct 6, 2024
4 tasks
@kashif kashif added the ✨ enhancement New feature or request label Oct 6, 2024
@gaetanlop gaetanlop marked this pull request as ready for review October 11, 2024 01:27
@gaetanlop gaetanlop marked this pull request as draft October 11, 2024 01:57
@gaetanlop
Copy link
Contributor Author

gaetanlop commented Oct 11, 2024

Some judges in CGPOTrainer need gold answers or metadata to make decisions. This gold answer can be used to compare with the policy's output or store metadata for rule-based judges (examples of this metadata are in Table 4).

I’m using the DataCollatorForChatML in the CGPOTrainer right now. Should we create a new dataset format with a prompt, completion, and gold answer in a separate PR, or should we modify DataCollatorForChatML to return the non-tokenized gold answer along with the current parameters?

What do you prefer ?@qgallouedec @kashif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants