Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect paired preference feedback #396

Open
Tracked by #392
RobotSail opened this issue Dec 7, 2024 · 0 comments
Open
Tracked by #392

Collect paired preference feedback #396

RobotSail opened this issue Dec 7, 2024 · 0 comments
Labels
enhancement New feature or request
Milestone

Comments

@RobotSail
Copy link
Member

RobotSail commented Dec 7, 2024

Overview

In the realm of fine-tuning language models, practitioners often employ preference tuning techniques to align chatbots with human preferences. A prominent method for this is Direct Policy Optimization (DPO), which aims to optimize the language model directly based on user preference data without the need for an intermediate reward model. This approach can simplify the training pipeline and enhance efficiency.

This issue relies on the following tickets:

Objective

Enable the InstructLab UI to collect user preference data by generating and presenting pairs of responses. This data will be stored in persistent storage to facilitate training and refining of the InstructLab models through DPO.
Requirements

To achieve this objective, the following capabilities need to be added to the UI and backend systems:

  1. Generating and Presenting Paired Responses
  2. Saving Preference Data in the Backend Database

Prompt the user for preference selection

In the InstructLab Chat, we'd like to add the capability to request two responses and stream both of their responses to the client simultaneously. After both models have completed streaming, the user would have the ability to select which response they prefer more. For example, consider the following screenshot from a popular AI tool:

image

The prompting of this feedback prompt could happen in one of two ways:

  1. We randomly prompt the users for this feedback every X% (maybe 5% or 10%, depending on DAU)
  2. Allow the user to have a "re-roll" button where they regenerate a response if they don't like the original. After which the user would be prompted whether the second response was better or worse.

Saving these data pairs in the backend database

In order to store this data for effective DPO, we need to record the following information for every instance of this event:

  • both paired responses (A, B)
  • user's preference (A or B)
  • the user's original message for which the responses were generated
  • conversation ID
  • model ID

References

To learn more about DPO, check out the paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

@vishnoianil vishnoianil added the enhancement New feature or request label Dec 17, 2024
@vishnoianil vishnoianil added this to UI Dec 17, 2024
@vishnoianil vishnoianil moved this to Backlog in UI Dec 17, 2024
@vishnoianil vishnoianil added this to the release-1.2 milestone Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

2 participants