Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Added reference to video, that you can use a private trainer model.
  • Loading branch information
jimdowling authored May 23, 2024
1 parent 81cadf2 commit 1e3d8f4
Showing 1 changed file with 15 additions and 6 deletions.
21 changes: 15 additions & 6 deletions advanced_tutorials/llm_pdfs/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,32 @@
# ⚙️ Index Private PDFs for RAG and create Fine-Tuning Datasets from them
# ⚙️ RAG and Fine-Tuning in Hopsworks - build a private PDF search system
* [Helper video describing how to implement this LLM PDF system](https://www.youtube.com/watch?v=8YDANJ4Gbis)

This project will take a google drive folder of PDF files that you provide and read them, index them in vector embeddings in Hopsworks for retrieval augmented generation (RAG) and create an instruction dataset for fine-tuning using a teacher model (GPT).
# ⚙️ Index Private PDFs for RAG, create and serve fine-tuned models from them, and include UI for querying

This project is an AI system built on Hopsworks that
* creates vector embeddings for PDF files in a google drive folder (you can also use local/network directories) and indexes them for retrieval augmented generation (RAG) in Hopsworks Feature Store with Vector Indexing
* creates an instruction dataset for fine-tuning using a teacher model (GPT by default, but you can easily configure to use a powerful private model such as Llama-3-70b)
* trains and hosts in the model registry a fine-tuned open-source foundation model (Mistral 7b by default, but can be easily changed for other models such as Llama-3-8b)
* provides a UI, written in Streamlit/Python, for querying your PDFs that returns answers, citing the page/paragraph/url-to-pdf in its answer.

![Hopsworks Architecture for Private PDFs Indexed for LLMs](../..//images/llm-pdfs-architecture.gif)

## 📖 Feature Pipeline
The Feature Pipeline does the following:

* Download any new PDFs from the google drive.
* Extract chunks of text from the PDFs and store them in a Feature Group in Hopsworks.
* Use GPT to generate an instruction set for the fine-tuning a foundation LLM and store as a feature group in Hopsworks.
* Extract chunks of text from the PDFs and store them in a Vector-Index enabled Feature Group in Hopsworks.
* Use GPT (or Llama-3-70b) to generate an instruction set for the fine-tuning of a foundation LLM and store the instruction dataset as a feature group in Hopsworks.

## 🏃🏻‍♂️Training Pipeline
This step is optional if you also want to create a fine-tuned model.
The Training Pipeline does the following:

* Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default) .
* Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default).
* Saves the fine-tuned model to Hopsworks Model Registry.

## 🚀 Inference Pipeline
* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and an embedded LLM.
* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and your embedded LLM (either an off-the-shelf model, like Mistral-7B-Instruct-v0.2, or your fine-tuned LLM.

## 🕵🏻‍♂️ Google Drive Credentials Creation

Expand All @@ -34,3 +41,5 @@ Next, integrate these files into your project:
2. Place both `credentials.json` and `client_secret.json` files inside this credentials directory.

Now, you are ready to download your PDFs from the Google Drive!


0 comments on commit 1e3d8f4

Please sign in to comment.