This repository demonstrates how to set up a Retrieval-Augmented Generation (RAG) pipeline using Docling, LangChain, and Colab. This setup allows for efficient document processing, embedding generation, vector storage, and querying with a Language Model (LLM). The following sections elaborate on the workflow, components, and implementation details.
The pipeline combines:
- Docling: For document loading and conversion into structured formats (Markdown, JSON, YAML, etc.).
- LangChain: For splitting documents, generating embeddings, and querying with a retriever and LLM.
- Colab: For interactive development and leveraging HuggingFace APIs.
flowchart TD
A[Load PDF Documents] --> B[DoclingPDFLoader]
B --> C[Convert to Markdown]
C --> D[Text Splitting with LangChain]
D --> E[Generate Embeddings with HuggingFace]
E --> F[Store Vectors in Milvus Vector Store]
F --> G[Create Retriever]
G --> H[Define RAG Prompt Template]
H --> I[Query with HuggingFace LLM]
I --> J[Retrieve Answers Based on Context]
- A Colab environment with internet access.
- HuggingFace account for API tokens.
- Python libraries:
docling
,langchain
,python-dotenv
, and vector storage tools likeMilvus
.
Install the required libraries:
%pip install -qq docling docling-core python-dotenv langchain-text-splitters langchain-huggingface langchain-milvus
- Documents are loaded using
DoclingPDFLoader
, which supports single or multiple PDF paths. - Converts PDF content into structured text formats (e.g., Markdown).
- Use
RecursiveCharacterTextSplitter
to divide content into chunks. - Adjust
chunk_size
andchunk_overlap
for better granularity.
- Use
HuggingFaceEmbeddings
to generate semantic vector embeddings. - Select the embedding model (e.g.,
sentence-transformers/all-MiniLM-L6-v2
).
- Store document embeddings in a
Milvus
vector database. - The temporary database is created using Python's
TemporaryDirectory
.
- Use the retriever to fetch relevant context based on input queries.
- Define a prompt template for the LLM to generate responses.
- Execute queries using the RAG chain to retrieve context-aware answers.
from docling.document_converter import DocumentConverter
class DoclingPDFLoader(BaseLoader):
def __init__(self, file_path):
self._file_paths = file_path if isinstance(file_path, list) else [file_path]
self._converter = DocumentConverter()
def lazy_load(self):
for source in self._file_paths:
dl_doc = self._converter.convert(source).document
text = dl_doc.export_to_markdown(strict_text=True)
yield LCDocument(page_content=text)
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
HF_EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)
from langchain_milvus import Milvus
vectorstore = Milvus.from_documents(
splits,
embeddings,
connection_args={"uri": MILVUS_URI},
drop_old=True,
)
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
retriever = vectorstore.as_retriever()
prompt = PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\n"
"Given the context information and not prior knowledge, answer the query.\nQuery: {question}\nAnswer:\n"
)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
- Modular Design: Easily customizable for different documents and queries.
- RAG Pipeline: Combines retrieval and generation for context-aware answers.
- Embedding Flexibility: Supports various HuggingFace embedding models.
- Vector Store Integration: Efficient vector management with Milvus.
- Interactive Queries: Leverages HuggingFace LLM for accurate responses.
rag_chain.invoke("Does Docling implement a linear pipeline of operations?")
rag_chain.invoke("How many pages were human annotated for DocLayNet?")
Partha Pratim Ray
GitHub: ParthaPRay