Collect paired preference feedback #396

RobotSail · 2024-12-07T04:05:13Z

Overview

In the realm of fine-tuning language models, practitioners often employ preference tuning techniques to align chatbots with human preferences. A prominent method for this is Direct Policy Optimization (DPO), which aims to optimize the language model directly based on user preference data without the need for an intermediate reward model. This approach can simplify the training pipeline and enhance efficiency.

This issue relies on the following tickets:

Objective

Enable the InstructLab UI to collect user preference data by generating and presenting pairs of responses. This data will be stored in persistent storage to facilitate training and refining of the InstructLab models through DPO.
Requirements

To achieve this objective, the following capabilities need to be added to the UI and backend systems:

Generating and Presenting Paired Responses
Saving Preference Data in the Backend Database

Prompt the user for preference selection

In the InstructLab Chat, we'd like to add the capability to request two responses and stream both of their responses to the client simultaneously. After both models have completed streaming, the user would have the ability to select which response they prefer more. For example, consider the following screenshot from a popular AI tool:

The prompting of this feedback prompt could happen in one of two ways:

We randomly prompt the users for this feedback every X% (maybe 5% or 10%, depending on DAU)
Allow the user to have a "re-roll" button where they regenerate a response if they don't like the original. After which the user would be prompted whether the second response was better or worse.

Saving these data pairs in the backend database

In order to store this data for effective DPO, we need to record the following information for every instance of this event:

both paired responses (A, B)
user's preference (A or B)
the user's original message for which the responses were generated
conversation ID
model ID

References

To learn more about DPO, check out the paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

RobotSail mentioned this issue Dec 7, 2024

[Epic] Data Pipeline for RLHF Tuning #392

Open

4 tasks

vishnoianil added the enhancement New feature or request label Dec 17, 2024

vishnoianil added this to UI Dec 17, 2024

vishnoianil moved this to Backlog in UI Dec 17, 2024

vishnoianil added this to the release-1.2 milestone Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect paired preference feedback #396

Collect paired preference feedback #396

RobotSail commented Dec 7, 2024 •

edited

Loading

Collect paired preference feedback #396

Collect paired preference feedback #396

Comments

RobotSail commented Dec 7, 2024 • edited Loading

Overview

Objective

Prompt the user for preference selection

Saving these data pairs in the backend database

References

RobotSail commented Dec 7, 2024 •

edited

Loading