- Course: CSCI-GA.2565
- Institution: New York University
- Term: Spring 2024
The Llama Langchain RAG project is an application designed specifically for fans of the beloved sitcom Friends for fun. Using the power of Retrieval-Augmented Generation (RAG) combined with a Language Model (LLM), this project employs LLaMA 2, fine-tuned with Lora technique using Replicate to provide detailed, contextually accurate answers to complex queries related to content, plot, and characters. The app is deployed using Streamlit, includes session chat history, and provides an option to select multiple LLaMA2 API endpoints on Replicate.
Try our app: friends-rag.streamlit.app/
Sample queries you can use: evaluation.txt
Note on Model Initialization: The first prediction request from fine-tuned models like "Finetuned LLaMA2" and "Finetuned LLaMA2 with RAG" will take longer (expect 3 to 5 minutes) after a period of inactivity due to a "cold boot," where the model needs to be fetched and loaded. Subsequent requests will respond much quicker. More details on cold boots can be found here.
Note: This is the production version of the application and is optimized for deployment. Running it locally may require modifications to suit the development environment.
- Relative API key(s) (optional; e.g. for embedding model)
- Python 3.11 or higher
- Git Large File Storage (LFS) for handling large datasets and model files
-
Install dependencies.
- [Optional but recommended]
- Create a virtual python environment with
python -m venv .venv
- Activate it with
source .venv/bin/activate
- Create a virtual python environment with
- Install dependencies with
pip install -r requirements.txt
- [Optional but recommended]
-
Create the Chroma DB:
python populate_database.py
-
Setup before being able to do inference:
-
Case 1: If you choose to run the base Llama 2 model locally, you'll need to have Ollama installed and run
ollama serve
in a seperate terminal. -
Case 2: If you choose to do inference with replicate with our models locally, you'll need to have
REPLICATE_API_TOKEN
setup as an environment variable. -
Case 3: You can simply test run our deployed project on streamlit: friends-rag.streamlit.app.
-
-
Test run to query the Chroma DB, the below command will return an output based on RAG and the selected model:
python query_data.py "Which role does Adam Goldberg plays?"
- Start the App locally:
streamlit run app.py
In case the file size exceeds Github's recommended maximum file size of 50.00 MB, you may need to use Git Large File Storage.
- Finetuning usually involves using a domain related dataset. In this project, we decided to curate our own (Question-Answer) pairs dataset for finetuning and RAG.
- Selected domain-related files (txt and jsonl) are stored in the
data
folder, such as trivia.txt and s1_s2.jsonl. Using Langchain, a vector database was created inchroma
folder based on the data for RAG. More content could be added as needed. - The front-end and deployment is implemented with Streamlit.
- Option to select between differnet Llama2 chat API endpoints (base LLaMA2, finetuned LLaMA2, base with RAG, finetuned with RAG).
- Each model (base LLaMA2, finetuned LLaMA2, base with RAG, finetuned with RAG) runs on Replicate.
The frontend was refactored from a16z's implementation of their LLaMA2 chatbot.