Skip to content

guochenmeinian/Llama-Langchain-RAG

Repository files navigation

Llama Langchain RAG Project

  • Course: CSCI-GA.2565
  • Institution: New York University
  • Term: Spring 2024

Overview

The Llama Langchain RAG project is an application designed specifically for fans of the beloved sitcom Friends for fun. Using the power of Retrieval-Augmented Generation (RAG) combined with a Language Model (LLM), this project employs LLaMA 2, fine-tuned with Lora technique using Replicate to provide detailed, contextually accurate answers to complex queries related to content, plot, and characters. The app is deployed using Streamlit, includes session chat history, and provides an option to select multiple LLaMA2 API endpoints on Replicate.

Try our app: friends-rag.streamlit.app/

Sample queries you can use: evaluation.txt

Note on Model Initialization: The first prediction request from fine-tuned models like "Finetuned LLaMA2" and "Finetuned LLaMA2 with RAG" will take longer (expect 3 to 5 minutes) after a period of inactivity due to a "cold boot," where the model needs to be fetched and loaded. Subsequent requests will respond much quicker. More details on cold boots can be found here.

Note: This is the production version of the application and is optimized for deployment. Running it locally may require modifications to suit the development environment.

Getting Started

Prerequisites

  • Relative API key(s) (optional; e.g. for embedding model)
  • Python 3.11 or higher
  • Git Large File Storage (LFS) for handling large datasets and model files

Installation

  1. Install dependencies.

    • [Optional but recommended]
      • Create a virtual python environment with
           python -m venv .venv
        
      • Activate it with
           source .venv/bin/activate
        
    • Install dependencies with
         pip install -r requirements.txt
      
  2. Create the Chroma DB:

python populate_database.py
  1. Setup before being able to do inference:

    • Case 1: If you choose to run the base Llama 2 model locally, you'll need to have Ollama installed and run ollama serve in a seperate terminal.

    • Case 2: If you choose to do inference with replicate with our models locally, you'll need to have REPLICATE_API_TOKEN setup as an environment variable.

    • Case 3: You can simply test run our deployed project on streamlit: friends-rag.streamlit.app.

  2. Test run to query the Chroma DB, the below command will return an output based on RAG and the selected model:

python query_data.py "Which role does Adam Goldberg plays?"
  1. Start the App locally:
streamlit run app.py

In case the file size exceeds Github's recommended maximum file size of 50.00 MB, you may need to use Git Large File Storage.

Configuration & Features:

  1. Finetuning usually involves using a domain related dataset. In this project, we decided to curate our own (Question-Answer) pairs dataset for finetuning and RAG.
  2. Selected domain-related files (txt and jsonl) are stored in the data folder, such as trivia.txt and s1_s2.jsonl. Using Langchain, a vector database was created in chroma folder based on the data for RAG. More content could be added as needed.
  3. The front-end and deployment is implemented with Streamlit.
  4. Option to select between differnet Llama2 chat API endpoints (base LLaMA2, finetuned LLaMA2, base with RAG, finetuned with RAG).
  5. Each model (base LLaMA2, finetuned LLaMA2, base with RAG, finetuned with RAG) runs on Replicate.

The frontend was refactored from a16z's implementation of their LLaMA2 chatbot.

Resources:

About

A Finetuned LLM coupled with RAG about "Friends".

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages