Skip to content

Latest commit

 

History

History
executable file
·
93 lines (66 loc) · 5.42 KB

README.md

File metadata and controls

executable file
·
93 lines (66 loc) · 5.42 KB

RAG Bootcamp

This is a collection of reference implementations for Vector Institute's RAG (Retrieval-Augmented Generation) Bootcamp, scheduled to take place from Nov 2024 to Jan 2025. It demonstrates some of the common methodologies used in RAG workflows (data ingestion, chunks, embeddings, vector databases, sparse/dense retrieval, reranking) using the popular Python LangChain and LlamaIndex libraries.

Reference Implementations

This repository includes several reference implementations showing different approaches and methodologies related to Retrieval-Augmented Generation.

  • Web Search: Popular LLMs like OpenAI's GPT-4o and Meta's Llama-3 are very good at processing natural language, but their knowledge is limited by the data they were trained on. As of November 2024, neither service can correctly answer the question "Who won the 2024 World Series of Baseball?"
  • Document Search: Use a collection of unstructured documents to answer domain-specific questions, like: "How many AI scholarships did Vector Institute award in 2022?"
  • SQL Search: Answer natural language questions with information from structured relational data. This demo uses a financial dataset from a Portugese banking instituation, available on Kaggle
  • Cloud Search: Retrieve information from data in a cloud service, in this example AWS S3 storage
  • PubMed QA: A full pipeline on the PubMed dataset demonstrating ingestion, embeddings, vector index/storage, retrieval, reranking, with a focus on evaluation metrics.
  • RAG Evaluation: RAG evaluation techniques based on the Ragas framework. Focuses on evaluation "test sets" and how to use these to determine how well a RAG pipeline is actually working.

Requirements

  • Python 3.10+

Git Repostory

Start by cloning this git repository to a local folder:

git clone https://github.com/VectorInstitute/rag_bootcamp

[Optional] Build the virtual Python environments

These instructions only apply if you are not running this code on the Vector Institute cluster. If you are are working on the Vector cluster, these environments are already pre-compiled and ready to use in the /ssd003/projects/aieng/public/rag_bootcamp/envs folder.

The notebooks contained in this repository depend on several different Python environments. Following table lists the environment for each notebook:

Notebooks Environment
Web Search, Document Search, SQL Search, Cloud Search rag_dataloaders
RAG Evaluation rag_evaluation
PubMed QA rag_pubmed_qa

Build these environments using the following instructions:

python3 --version # Make sure this shows Python 3.10+!

# Install the dataloaders environment
python3 -m venv ./rag_dataloaders
source rag_dataloaders/bin/activate
python3 -m pip install -r ./envs/rag_dataloaders/requirements.txt
deactivate

# Install the evaluation environment
python3 -m venv ./rag_evaluation
source rag_evaluation/bin/activate
python3 -m pip install -r ./envs/rag_evaluation/requirements.txt
deactivate

# Install the pubmed_qa environment
python3 -m venv ./rag_pubmed_qa
source rag_pubmed_qa/bin/activate
python3 -m pip install -r ./envs/rag_pubmed_qa/requirements.txt
deactivate

Add the Jupyter notebook kernels

These kernels are required for the notebooks in this repository. You can make them available to Jupyter with the following instructions:

# The following path is for use on the Vector cluster. If you are using a different environment, update this accordingly.
export RAG_BOOTCAMP_ENV="/ssd003/projects/aieng/public/rag_bootcamp/envs"

source $RAG_BOOTCAMP_ENV/rag_dataloaders/bin/activate
ipython kernel install --user --name=rag_dataloaders
deactivate

source $RAG_BOOTCAMP_ENV/rag_evaluation/bin/activate
ipython kernel install --user --name=rag_evaluation
deactivate

source $RAG_BOOTCAMP_ENV/rag_pubmed_qa/bin/activate
ipython kernel install --user --name=rag_pubmed_qa
deactivate

Lastly, start a Jupyter notebook

# The following path is for use on the Vector cluster. If you are using a different environment, update this accordingly.
export RAG_BOOTCAMP_ENV="/ssd003/projects/aieng/public/rag_bootcamp/envs"

source $RAG_BOOTCAMP_ENV/<env_to_be_used>/bin/activate
jupyter notebook --ip $(hostname --fqdn)