Skip to content
/ vod Public

End-to-end training of Retrieval-Augmented LMs (REALM, RAG)

License

Notifications You must be signed in to change notification settings

VodLM/vod

Repository files navigation

VOD

Python PyTorch Lightning Config: hydra Code style: black


Research paper: Variational Open-Domain Question Answering, ICML 2023

Latest News πŸ”₯

  • October 16' 2023: We released the version 0.2.0 of vod - simpler & better code ✨
  • September 21' 2023: We integrated the BeIR datasets 🍻
  • July 19' 2023: We integrated Qdrant as a search backend πŸ“¦
  • June 15' 2023: We adopted Ligthning's Fabric ⚑️
  • April 24' 2023: We will be presenting Variational Open-Domain Question Answering at ICML2023 @ Hawaii 🌊

What is VOD? 🎯

VOD aims at building, training and evaluating next-generation retrieval-augmented language models (REALMs). The project started with our research paper Variational Open-Domain Question Answering, in which we introduce the VOD objective: a new variational objective for end-to-end training of REALMs.

The original paper only explored the application of the VOD objective to multiple-choice ODQA, this repo aims at exploring generative tasks (generative QA, language modelling and chat). We are building tools to make training of large generative search models possible and developper-friendly. The main modules are:

  • vod_dataloaders: efficient torch.utils.DataLoader with dynamic retrieval from multiple search engines
  • vod_search: a common interface to handle sparse and dense search engines (elasticsearch, faiss, Qdrant)
  • vod_models: a collection of REALMs using large retrievers (T5s, e5, etc.) and OS generative models (RWKV, LLAMA 2, etc.). We also include vod_gradients, a module to compute gradients of RAGs end-to-end.

Roadmap Summer 2023 β˜€οΈ

Progress tracked in #1

The repo is currently in research preview. This means we already have a few components in place, but we still have work to do before a wider adoption of VOD, and before training next-gen REALMS. Our objectives for this summer are:

  • Search API: add Filtering Capabilities
  • Datasets: support more datasets for common IR & Gen AI
  • UX: plug-and-play, extendable
  • Modelling: implement REALM for Generative Tasks
  • Gradients: VOD for Generative Tasks

Join us 🀝

If you also see great potential in combining LLMs with search components, join the team! We welcome developers interested in building scalable and easy-to-use tools as well as NLP researchers.

Project Structure πŸ—οΈ

Module Usage Status
vod_configs Sturctured pydantic configurations βœ…
vod_dataloaders Dataloaders for retrieval-augmented tasks βœ…
vod_datasets Universal dataset interface (rosetta) and custom dataloaders (e.g., BeIR) βœ…
vod_exps Research experiments, configurable with hydra βœ…
vod_models A collection of REALMs + gradients (retrieval, VOD, etc.) ⚠️
vod_ops ML operations using lightning.Fabric (training, benchmarking, indexing, etc.) βœ…
vod_search Hybrid and sharded search clients (elasticsearch, faiss and qdrant) βœ…
vod_tools A collection of utilities (pretty printing, argument parser, etc.) βœ…
vod_types A collection data structures and python typing modules βœ…

Note The code for VOD gradient and sampling methods currently lives at VodLM/vod-gradients. The project is still under development and will be integrated into this repo in the next month.

Installation πŸ“¦

We only support development mode for now. You need to install and run elasticsearch on your system. Then install poetry and the project using:

curl -sSL https://install.python-poetry.org | python3 -
poetry install

Note See the tips and tricks section to build faiss latest with CUDA support

Examples

# How load datasets wtih the universal `rosetta` intergace
poetry run python -m examples.datasets.rosetta

# How to start and use a `faiss` search engine
poetry run python -m examples.search.faiss

# How to start and use a `qdrant` search engine
poetry run python -m examples.search.qdrant

# How to compute embeddings for a large dataset using `lighning.Fabric`
poetry run python -m examples.features.predict

# How to build a Realm dataloader backed by a Hybrid search engine
poetry run python -m examples.features.dataloader

Using the trainer CLI πŸš€

VOD allows training large retrieval models while dynamically retrieving sections from a large knowledge base. The CLI is accessible with:

poetry run train

# Debugging -- Run the training script with a small model and small dataset
poetry run train model/encoder=debug datasets=scifact
πŸ”§ Arguments & config files

The train endpoint uses hydra to parse arguments and configure the run. See vod_exps/hydra/main.yaml for the default configuration. You can override any of the default values by passing them as arguments to the train endpoint. For example, to train a T5-base encoder on MSMarco using FSDP:

poetry run train model/encoder=t5-base batch_size.per_device=4 datasets=msmarco fabric/strategy=fsdp

Technical details

πŸ™ Multiple datasets & Sharded search

VOD is built for multi-dataset training. Youn can multiple training/validation/test datasets, each pointing to a different corpus. Each corpus can be augmented with a specific search backend. For instance this config allows using Qdrant as a backend for the squad sections and QuALITY contexts while using faiss to index Wikipedia.

VOD implement a hybrid sharded search engine. This means that for each indexed corpus, VOD fits multiple search engines (e.g., Elasticsearch + Qdrant). At query time, data points are dispatched to each shard (corpus) based on the dataset.link attribute.

Sharded search

Tips and Tricks 🦊

🐍 Setup a Mamba environment and build faiss-gpu from source
# install mamba
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
# setup base env - try to run it, or follow the script step by step
bash setup-scripts/setup-mamba-env.sh
# build faiss - try to run it, or follow the script step by step
bash setup-scripts/build-faiss-gpu.sh
# **Optional**: Install faiss-gpu in your poetry env:
export PYPATH=`poetry run which python`
(cd libs/faiss/build/faiss/python && $PYPATH setup.py install)

Citation πŸ“š

Variational Open-Domain Question Answering

This repo is a clean re-write of the original code FindZebra/fz-openqa aiming at handling larger datasets, larger models and generative tasks.

@InProceedings{pmlr-v202-lievin23a,
  title =   {Variational Open-Domain Question Answering},
  author =       {Li\'{e}vin, Valentin and Motzfeldt, Andreas Geert and Jensen, Ida Riis and Winther, Ole},
  booktitle =   {Proceedings of the 40th International Conference on Machine Learning},
  pages =   {20950--20977},
  year =   {2023},
  editor =   {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume =   {202},
  series =   {Proceedings of Machine Learning Research},
  month =   {23--29 Jul},
  publisher =    {PMLR},
  pdf =   {https://proceedings.mlr.press/v202/lievin23a/lievin23a.pdf},
  url =   {https://proceedings.mlr.press/v202/lievin23a.html},
  abstract =   {Retrieval-augmented models have proven to be effective in natural language processing tasks, yet there remains a lack of research on their optimization using variational inference. We introduce the Variational Open-Domain (VOD) framework for end-to-end training and evaluation of retrieval-augmented models, focusing on open-domain question answering and language modelling. The VOD objective, a self-normalized estimate of the RΓ©nyi variational bound, approximates the task marginal likelihood and is evaluated under samples drawn from an auxiliary sampling distribution (cached retriever and/or approximate posterior). It remains tractable, even for retriever distributions defined on large corpora. We demonstrate VOD’s versatility by training reader-retriever BERT-sized models on multiple-choice medical exam questions. On the MedMCQA dataset, we outperform the domain-tuned Med-PaLM by +5.3% despite using 2.500$\times$ fewer parameters. Our retrieval-augmented BioLinkBERT model scored 62.9% on the MedMCQA and 55.0% on the MedQA-USMLE. Last, we show the effectiveness of our learned retriever component in the context of medical semantic search.}
}

Partners 🏫

The project is currently supported by the Technical University of Denmark (DTU) and Raffle.ai.

DTU logo Β Β Β Β Β  Raffle.ai logo