You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the realm of fine-tuning language models, practitioners often employ preference tuning techniques to align chatbots with human preferences. A prominent method for this is Direct Policy Optimization (DPO), which aims to optimize the language model directly based on user preference data without the need for an intermediate reward model. This approach can simplify the training pipeline and enhance efficiency.
Enable the InstructLab UI to collect user preference data by generating and presenting pairs of responses. This data will be stored in persistent storage to facilitate training and refining of the InstructLab models through DPO.
Requirements
To achieve this objective, the following capabilities need to be added to the UI and backend systems:
Generating and Presenting Paired Responses
Saving Preference Data in the Backend Database
Prompt the user for preference selection
In the InstructLab Chat, we'd like to add the capability to request two responses and stream both of their responses to the client simultaneously. After both models have completed streaming, the user would have the ability to select which response they prefer more. For example, consider the following screenshot from a popular AI tool:
The prompting of this feedback prompt could happen in one of two ways:
We randomly prompt the users for this feedback every X% (maybe 5% or 10%, depending on DAU)
Allow the user to have a "re-roll" button where they regenerate a response if they don't like the original. After which the user would be prompted whether the second response was better or worse.
Saving these data pairs in the backend database
In order to store this data for effective DPO, we need to record the following information for every instance of this event:
both paired responses (A, B)
user's preference (A or B)
the user's original message for which the responses were generated
Overview
In the realm of fine-tuning language models, practitioners often employ preference tuning techniques to align chatbots with human preferences. A prominent method for this is Direct Policy Optimization (DPO), which aims to optimize the language model directly based on user preference data without the need for an intermediate reward model. This approach can simplify the training pipeline and enhance efficiency.
This issue relies on the following tickets:
Objective
Enable the InstructLab UI to collect user preference data by generating and presenting pairs of responses. This data will be stored in persistent storage to facilitate training and refining of the InstructLab models through DPO.
Requirements
To achieve this objective, the following capabilities need to be added to the UI and backend systems:
Prompt the user for preference selection
In the InstructLab Chat, we'd like to add the capability to request two responses and stream both of their responses to the client simultaneously. After both models have completed streaming, the user would have the ability to select which response they prefer more. For example, consider the following screenshot from a popular AI tool:
The prompting of this feedback prompt could happen in one of two ways:
Saving these data pairs in the backend database
In order to store this data for effective DPO, we need to record the following information for every instance of this event:
References
To learn more about DPO, check out the paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
The text was updated successfully, but these errors were encountered: