This repository contains the implementation of CLLM from the paper "Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes".
🔎 CLLM is a synthetic data generation framework using LLMs for synthetic generation, coupled with a principled data-centric curation mechanism --- ensuring high quality data!
CLLM supports using LLMs via Azure OpenAI, Together or VLLM 🥳
For more details, please read our paper: Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes.
- Clone the repository
- (a) Create a new virtual environment with Python 3.10. e.g:
virtualenv cllm_env
- (b) Create a new conda environment with Python 3.10. e.g:
conda create -n cllm_env python=3.10
- With the venv or conda env activated, run the following command from the repository directory:
- Install the minimum requirements to run CLLM
pip install -r requirements.txt
- Link the environment to the kernel:
python -m ipykernel install --user --name=cllm_env
To get started with CLLM one can try the tutorial.ipynb
notebook in the root folder. One can generate synthetic data using LLM served via OpenAI, Together or VLLM.
To run generation of multiple datasets see run_llm_generator.ipynb
To run the insights experiments one can run any of the Jupyter notebooks (.ipynb) found in the notebooks
folder
If you use this code, please cite the associated paper:
@inproceedings{
cllm2024,
title={Curated {LLM}: Synergy of {LLM}s and Data Curation for tabular augmentation in low-data regimes},
author={Nabeel Seedat and Nicolas Huynh and Boris van Breugel and Mihaela van der Schaar},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
}