This project demonstrates how to implement Retrieval-Augmented Generation (RAG) using Gradio, LangChain, Docling, Milvus, and HuggingFace. The system supports multiple file types, including PDF, Images, HTML, and PPTX, for document conversion, chunking, and question answering.
-
File Formats Supported:
- PDF: Text extracted using the
PyPdfiumDocumentBackend
. - Images: OCR can be implemented if actual text extraction is desired.
- HTML: Direct parsing of HTML content.
- PPTX: PowerPoint files processed for extracting slide content.
- Other formats under testing:
.txt
,.md
,.asciidoc
- Not supported:
.docx
(Word),.xlsx
(Excel).
- PDF: Text extracted using the
-
Chunking:
- Utilizes LangChain's
RecursiveCharacterTextSplitter
for splitting documents into manageable chunks. - Configurable
chunk_size
andchunk_overlap
for optimal performance.
- Utilizes LangChain's
-
RAG Workflow:
- Extracts text from uploaded files using
Docling
. - Splits documents into chunks for embedding and retrieval.
- Uses Milvus as a vector store for embedding-based retrieval.
- Employs HuggingFace LLMs (e.g., Mistral-7B-Instruct) for answering user queries based on the retrieved context.
- Extracts text from uploaded files using
-
Gradio Interface:
- Upload & Split: Allows users to upload files and split them into chunks.
- RAG Q&A: Enables users to ask questions based on the uploaded content.
Install the following dependencies to run the project:
pip install docling docling-core python-dotenv langchain-text-splitters langchain-huggingface langchain-milvus gradio
-
Document Conversion:
- The
DocumentConverter
from Docling automatically detects file formats and converts them into a unified text representation. - Extracted text is exported in Markdown format for consistency.
- The
-
Embedding and Retrieval:
- Documents are embedded using the
sentence-transformers/all-MiniLM-L6-v2
model. - Embeddings are stored in a temporary Milvus vector database for fast retrieval.
- Documents are embedded using the
-
RAG Chain:
- Retrieved context is formatted into a prompt using LangChain's
PromptTemplate
. - HuggingFace's endpoint is used to query LLMs like
Mistral-7B-Instruct
.
- Retrieved context is formatted into a prompt using LangChain's
Run the following command to launch the Gradio app:
python gradio_docling_RAG_langchain.py
- Navigate to the Upload & Split tab.
- Upload supported file types (PDF, Images, HTML, PPTX).
- Click "Split Documents" to chunk the uploaded files.
- Go to the RAG Q&A tab.
- Input a question related to the uploaded documents.
- Click "Ask" to get an answer based on the document content.
File Type | Handling Mechanism | Notes |
---|---|---|
PyPdfiumDocumentBackend | Extracts text from PDF files. | |
Images | OCR-based (if implemented) | Extracts text from images (requires OCR). |
HTML | Direct Parsing | Extracts text from HTML files. |
PPTX | Slide Content Parsing | Extracts text from PowerPoint slides. |
.docx
(Word documents): Currently not supported..xlsx
(Excel files): Parsing not implemented.
- Upload a PDF file.
- Split into chunks:
- A PDF with 10 pages is chunked into 1000-character segments with a 200-character overlap.
- Ask a question:
- Query: "What is the content of the first page?"
- The system retrieves relevant chunks and generates an answer using the LLM.
- Add support for
.txt
and.md
file formats. - Expand compatibility with
.docx
and.xlsx
. - Integrate advanced OCR libraries for better image processing.
- Implement additional chunking strategies.
Partha Pratim Ray
GitHub: ParthaPRay