Reward a Language Model with pancakes π₯
This repository gathers three main modules. Their operation is shared, allowing the training of any generative model following the two main techniques of Reinforcement Learning w/ PPO (π₯ RLAIF) and the more classical π¨πΌβπ« fine-tune using PEFT techniques. The third module, βοΈ Toxicity Meter, deals with measuring the toxicity of the responses of the generative model, whether pre-trained or after the π₯ or π¨πΌβπ« process.
This module allows the use of reinforcement learning algorithms (specifically PPO) to optimise models according to a direction decided by the reward model. The process is similar to RLHF (Reinforcement Learning from Human Feedback) but removes the human component from the loop to automate the process.
To π₯ Reward a generative LM using the DIALCONAN dataset:
- Select the generative and reward models you intend to use and other hyperparameters:
import torch
from rewardlm.core.RL.RLModel import RLModel
rlmanager = RLModel(
model_id = 'EleutherAI/pythia-70m',
reward_model_id = 'facebook/roberta-hate-speech-dynabench-r4-target',
optimized = True, # use 8-bit PEFT
# log_method = 'wandb',
bs = 256,
# force the use of CPU on Apple Silicon devices (mps not supported):
accelerator_kwargs = {
'cpu': False if torch.cuda.is_available() else True,
},
)
- Download the original dataset using the built in preprocessing functions:
from rewardlm.data.data_utils import get_DIALOCONAN_prepro
data = get_DIALOCONAN_prepro(delete_last_assistant_response = True)
dataset = rlmanager.generate_dataset(text = data)
- Start the PPO learning algorithm:
history = rlmanager.train_PPO(dataset = dataset)
Each generative model can be fine-tuned on the same data used for Reinforcement Learning. In this way, it is possible to compare the results obtained from both techniques.
To fine-tune a generative model using the DIALCONAN dataset:
- Select the model you intend to use and the
GenerativeModel
to get the use it:
import torch
from rewardlm.core.GenerativeModel import GenerativeModel
model_id = 'facebook/opt-350m'
generator_manager = GenerativeModel(
model_id,
load_dtype = '8-bit' if torch.cuda.is_available() else 'fp32',
# force the use of CPU on Apple Silicon devices (mps not supported):
accelerator_kwargs = {
'cpu': False if torch.cuda.is_available() else True,
},
)
- Download the original dataset using the built in preprocessing functions:
from rewardlm.data.data_utils import get_DIALOCONAN_prepro
from rewardlm.data.CustomDatasets import PromptDataset_CLM
data = get_DIALOCONAN_prepro()
dataset = PromptDataset_CLM(
tokenizer = generator_manager.tokenizer,
text = data,
custom_prompt = custom_prompt,
)
- Start the fine-tutning process:
generator_manager.fine_tune(
torch_dataset = dataset,
optimized = True if torch.cuda.is_available() else False,
)
Toxicity meter allows measuring the toxicity of generative LM based on the output of a classifier (RoBERTa for hate speech as default if no RewardModel
is used)
- Select a configuration (or create your own):
from rewardlm.utils import load_config
config = load_config(name = 'RedPajama-INCITE-Chat-3B-v1')
- Use the
GenerativeModel
class to get a generation manager:
import torch
from transformers import GenerationConfig
from rewardlm.core.GenerativeModel import GenerativeModel
from rewardlm.ToxicityMeter import ToxicityMeter
from rewardlm.utils import load_config
generator_manager = GenerativeModel(
config['model_id'],
load_from_peft = config['load_from_peft'],
generation_config=config['generation']['generation_config'],
# force the use of CPU on Apple Silicon devices (mps not supported):
accelerator_kwargs = {
'cpu': False if torch.cuda.is_available() else True,
},
)
- Customize the prompt from the original dataset and generate the
toxicity_df
dataset:
from rewardlm.data.data_utils import get_real_toxicity_prompts
toxicity_meter = ToxicityMeter(generator_manager)
batchsize = 12
custom_prompt = (config['generation']['custom_prompt']['user_name'] +
' "{prompt}".\n' +
config['generation']['custom_prompt']['bot_name'] + ' '
)
df = get_real_toxicity_prompts()
toxicity_df = toxicity_meter.measure_toxicity(
text_prompt = df if not config['data']['subset'] else df[:config['data']['subset_size']],
custom_prompt=custom_prompt,
batch_size=batchsize,
print_response=True,
)
- Save the obtained results:
fld = './result analysis/tmp'
toxicity_df.to_csv(
fld + f'/measured_tox_instruct_{config["generation"]["model_id"].split("/")[-1]}_{load_dtype}.csv'
)
-
LaMini-LM
: Small-sized collection of efficient language models distilled from ChatGPT and trained on a large-scale dataset of 2.58M instructions GitHub, Paper -
RedPajama-*
: Source -
BloomZ
: Family of models capable of following human instructions in dozens of languages zero-shot GitHub, Paper -
Pythia
: Predominantly abandoned in favour of instructed models. Model(s) that combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers. GitHub, Paper -
Falcon-*-isntruct
: Causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. Source, Source istructed 7B model.
-
Real Toxicity Prompts
: Mainly used for the βοΈToxicityMeter
module. Dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. GitHub, Paper -
DIALOCONAN
: Mainly used for π¨πΌβπ«fine-tuning
and π₯RLAF
modules. Datasets of counter-narratives to fight online hate speech. GitHub, Paper
roberta-hate-speech-dynabench-r4-target
: Model trained on βΌ40,000 entries, generated and labelled by trained annotators over four rounds of dynamic data creation. Paper
How to setup on Google Colab:
- Import the main notebook in colab
- Include the following cell at the beginning:
!git clone https://__TOKEN_GIT__:@github.com/DanielSc4/RewardLM.git
%cd RewardLM/
!pip install -r requirements.txt
from huggingface_hub import login
login(token = '__TOKEN_HF__')
- [Opt, only if the repo is private] Replace
__TOKEN_GIT__
with your git token (more info here) - Replace
__TOKEN_HF__
with you π€ HuggingFace personal token
Dependency install:
- Install
poetry
, a Python package manager - It is recommended to run the following command to let
poetry
create the virtual environment for the project directly inside the root folder, allowing IDEs to detect dependencies and executables
poetry config virtualenvs.in-project true
- Inside the root folder, run
poetry install
to get all the dependencies. See Poetry docs for a thorough explanation of how poetry works
Activating virtual env:
To run a project file, you will need to use the interpreter installed by Poetry in the virtual environment, usually located
in rewardlm/.venv/bin/
. To do that, you can use poetry run
command, followed by the name of the script that you
want to run (Poetry run doc).
You can also run the following command to ensure that the terminal will use the correct python version (the one downloaded in the virtual env) together with its whole set of dependencies:
source .venv/bin/activate
- Catch & handle
ValueError: Responses are too short. Make sure they are at least 4 tokens long.
error skipping current batch that generates the anomaly. - Add support for checkpointing and tracking more info.
- Add support for dynamic batch size based on Memory Utilities from π€ HuggingFace.
- [fix] Fix short responses behavior (less than 4 tokens) [fix based on
generation_config
, TODO: how the generation change w/ bigger models?] - Add support for model's sharing (and backup) on π€ HuggingFace hub!
- Add possibility of using a reward manager as a reward model, to have more control over the reward system.
- Compatibility of βοΈ ToxicityMeter with other datasets (possibly instructional).
- Extend βοΈ ToxicityMeter compatibility with π€ Accelerate.
- Extend the possibility of managing parameters and configurations to π₯RLAF.
- Use of Inseq for analysis and interpretability of generative models at βοΈ ToxicityMeter.