Skip to content

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.

License

Notifications You must be signed in to change notification settings

simonraj79/bonito

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bonito

Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.

Bonito

Installation

Create an environment and install the package using the following commands:

conda create -n bonito python=3.9
conda activate bonito
pip install -e .

Basic Usage

To generate synthetic instruction tuning dataset using Bonito, you can use the following code:

from bonito import Bonito, SamplingParams
from datasets import load_dataset

# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")

# load dataset with unannotated text
unannotated_text = load_dataset(
    "BatsResearch/bonito-experiment",
    "unannotated_contract_nli"
)["train"].select(range(10))

# Generate synthetic instruction tuning dataset
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    unannotated_text,
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params
)

Supported Task Types [full name (short form)]: extractive question answering (exqa), multiple-choice question answering (mcqa), question generation (qg), question answering without choices (qa), yes-no question answering (ynqa), coreference resolution (coref), paraphrase generation (paraphrase), paraphrase identification (paraphrase_id), sentence completion (sent_comp), sentiment (sentiment), summarization (summarization), text generation (text_gen), topic classification (topic_class), word sense disambiguation (wsd), textual entailment (te), natural language inference (nli)

You can use either the full name or the short form to specify the task_type.

Citation

If you use Bonito in your research, please cite the following paper:

@article{bonito:arxiv24,
  Author = {Nihal V. Nayak and Yiyang Nan and Avi Trost and Stephen H. Bach},
  Title = {Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation},
  Volume = {arXiv:2402.18334 [cs.CL]},
  Year = {2024}}

About

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%