GitHub - VodLM/vod: End-to-end training of Retrieval-Augmented LMs (REALM, RAG)

Research paper: Variational Open-Domain Question Answering, ICML 2023

Latest News 🔥

October 16' 2023: We released the version 0.2.0 of vod - simpler & better code ✨
September 21' 2023: We integrated the BeIR datasets 🍻
July 19' 2023: We integrated Qdrant as a search backend 📦
June 15' 2023: We adopted Ligthning's Fabric ⚡️
April 24' 2023: We will be presenting Variational Open-Domain Question Answering at ICML2023 @ Hawaii 🌊

What is VOD? 🎯

VOD aims at building, training and evaluating next-generation retrieval-augmented language models (REALMs). The project started with our research paper Variational Open-Domain Question Answering, in which we introduce the VOD objective: a new variational objective for end-to-end training of REALMs.

The original paper only explored the application of the VOD objective to multiple-choice ODQA, this repo aims at exploring generative tasks (generative QA, language modelling and chat). We are building tools to make training of large generative search models possible and developper-friendly. The main modules are:

vod_dataloaders: efficient torch.utils.DataLoader with dynamic retrieval from multiple search engines
vod_search: a common interface to handle sparse and dense search engines (elasticsearch, faiss, Qdrant)
vod_models: a collection of REALMs using large retrievers (T5s, e5, etc.) and OS generative models (RWKV, LLAMA 2, etc.). We also include vod_gradients, a module to compute gradients of RAGs end-to-end.

Roadmap Summer 2023 ☀️

Progress tracked in #1

The repo is currently in research preview. This means we already have a few components in place, but we still have work to do before a wider adoption of VOD, and before training next-gen REALMS. Our objectives for this summer are:

Search API: add Filtering Capabilities
Datasets: support more datasets for common IR & Gen AI
UX: plug-and-play, extendable
Modelling: implement REALM for Generative Tasks
Gradients: VOD for Generative Tasks

Join us 🤝

If you also see great potential in combining LLMs with search components, join the team! We welcome developers interested in building scalable and easy-to-use tools as well as NLP researchers.

Project Structure 🏗️

Module	Usage	Status
vod_configs	Sturctured `pydantic` configurations	✅
vod_dataloaders	Dataloaders for retrieval-augmented tasks	✅
vod_datasets	Universal dataset interface (`rosetta`) and custom dataloaders (e.g., BeIR)	✅
vod_exps	Research experiments, configurable with `hydra`	✅
vod_models	A collection of REALMs + gradients (retrieval, VOD, etc.)	⚠️
vod_ops	ML operations using `lightning.Fabric` (training, benchmarking, indexing, etc.)	✅
vod_search	Hybrid and sharded search clients (`elasticsearch`, `faiss` and `qdrant`)	✅
vod_tools	A collection of utilities (pretty printing, argument parser, etc.)	✅
vod_types	A collection data structures and python `typing` modules	✅

Note The code for VOD gradient and sampling methods currently lives at VodLM/vod-gradients. The project is still under development and will be integrated into this repo in the next month.

Installation 📦

We only support development mode for now. You need to install and run elasticsearch on your system. Then install poetry and the project using:

curl -sSL https://install.python-poetry.org | python3 -
poetry install

Note See the tips and tricks section to build faiss latest with CUDA support

Examples

# How load datasets wtih the universal `rosetta` intergace
poetry run python -m examples.datasets.rosetta

# How to start and use a `faiss` search engine
poetry run python -m examples.search.faiss

# How to start and use a `qdrant` search engine
poetry run python -m examples.search.qdrant

# How to compute embeddings for a large dataset using `lighning.Fabric`
poetry run python -m examples.features.predict

# How to build a Realm dataloader backed by a Hybrid search engine
poetry run python -m examples.features.dataloader

Using the trainer CLI 🚀

VOD allows training large retrieval models while dynamically retrieving sections from a large knowledge base. The CLI is accessible with:

poetry run train

# Debugging -- Run the training script with a small model and small dataset
poetry run train model/encoder=debug datasets=scifact

🔧 Arguments & config files

The train endpoint uses hydra to parse arguments and configure the run. See vod_exps/hydra/main.yaml for the default configuration. You can override any of the default values by passing them as arguments to the train endpoint. For example, to train a T5-base encoder on MSMarco using FSDP:

poetry run train model/encoder=t5-base batch_size.per_device=4 datasets=msmarco fabric/strategy=fsdp

Technical details

🐙 Multiple datasets & Sharded search

VOD is built for multi-dataset training. Youn can multiple training/validation/test datasets, each pointing to a different corpus. Each corpus can be augmented with a specific search backend. For instance this config allows using Qdrant as a backend for the squad sections and QuALITY contexts while using faiss to index Wikipedia.

VOD implement a hybrid sharded search engine. This means that for each indexed corpus, VOD fits multiple search engines (e.g., Elasticsearch + Qdrant). At query time, data points are dispatched to each shard (corpus) based on the dataset.link attribute.

Tips and Tricks 🦊

🐍 Setup a Mamba environment and build faiss-gpu from source

# install mamba
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
# setup base env - try to run it, or follow the script step by step
bash setup-scripts/setup-mamba-env.sh
# build faiss - try to run it, or follow the script step by step
bash setup-scripts/build-faiss-gpu.sh
# **Optional**: Install faiss-gpu in your poetry env:
export PYPATH=`poetry run which python`
(cd libs/faiss/build/faiss/python && $PYPATH setup.py install)

Citation 📚

Variational Open-Domain Question Answering

This repo is a clean re-write of the original code FindZebra/fz-openqa aiming at handling larger datasets, larger models and generative tasks.

@InProceedings{pmlr-v202-lievin23a,
  title =   {Variational Open-Domain Question Answering},
  author =       {Li\'{e}vin, Valentin and Motzfeldt, Andreas Geert and Jensen, Ida Riis and Winther, Ole},
  booktitle =   {Proceedings of the 40th International Conference on Machine Learning},
  pages =   {20950--20977},
  year =   {2023},
  editor =   {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume =   {202},
  series =   {Proceedings of Machine Learning Research},
  month =   {23--29 Jul},
  publisher =    {PMLR},
  pdf =   {https://proceedings.mlr.press/v202/lievin23a/lievin23a.pdf},
  url =   {https://proceedings.mlr.press/v202/lievin23a.html},
  abstract =   {Retrieval-augmented models have proven to be effective in natural language processing tasks, yet there remains a lack of research on their optimization using variational inference. We introduce the Variational Open-Domain (VOD) framework for end-to-end training and evaluation of retrieval-augmented models, focusing on open-domain question answering and language modelling. The VOD objective, a self-normalized estimate of the Rényi variational bound, approximates the task marginal likelihood and is evaluated under samples drawn from an auxiliary sampling distribution (cached retriever and/or approximate posterior). It remains tractable, even for retriever distributions defined on large corpora. We demonstrate VOD’s versatility by training reader-retriever BERT-sized models on multiple-choice medical exam questions. On the MedMCQA dataset, we outperform the domain-tuned Med-PaLM by +5.3% despite using 2.500$\times$ fewer parameters. Our retrieval-augmented BioLinkBERT model scored 62.9% on the MedMCQA and 55.0% on the MedQA-USMLE. Last, we show the effectiveness of our learned retriever component in the context of medical semantic search.}
}

Partners 🏫

The project is currently supported by the Technical University of Denmark (DTU) and Raffle.ai.

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
.circleci		.circleci
.github		.github
assets		assets
examples		examples
setup-scripts		setup-scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
VERSION		VERSION
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
watch_gpus.py		watch_gpus.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latest News 🔥

What is VOD? 🎯

Roadmap Summer 2023 ☀️

Join us 🤝

Project Structure 🏗️

Installation 📦

Examples

Using the trainer CLI 🚀

Technical details

Tips and Tricks 🦊

Citation 📚

Variational Open-Domain Question Answering

Partners 🏫

About

Releases

Packages

Contributors 3

Languages

License

VodLM/vod

Folders and files

Latest commit

History

Repository files navigation

Latest News 🔥

What is VOD? 🎯

Roadmap Summer 2023 ☀️

Join us 🤝

Project Structure 🏗️

Installation 📦

Examples

Using the trainer CLI 🚀

Technical details

Tips and Tricks 🦊

Citation 📚

Variational Open-Domain Question Answering

Partners 🏫

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages