Skip to content
71 changes: 46 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# ID-based RAG FastAPI

## Overview

This project integrates Langchain with FastAPI in an Asynchronous, Scalable manner, providing a framework for document indexing and retrieval, using PostgreSQL/pgvector.

Files are organized into embeddings by `file_id`. The primary use case is for integration with [LibreChat](https://librechat.ai), but this simple API can be used for any ID-based use case.
Expand All @@ -10,6 +11,7 @@ The main reason to use the ID approach is to work with embeddings on a file-leve
The API will evolve over time to employ different querying/re-ranking methods, embedding models, and vector stores.

## Features

- **Document Management**: Methods for adding, retrieving, and deleting documents.
- **Vector Store**: Utilizes Langchain's vector store for efficient document retrieval.
- **Asynchronous Support**: Offers async operations for enhanced performance.
Expand All @@ -29,6 +31,7 @@ The API will evolve over time to employ different querying/re-ranking methods, e
- Local:
- Make sure to setup `DB_HOST` to the correct database hostname
- Run the following commands (preferably in a [virtual environment](https://realpython.com/python-virtual-environments-a-primer/))

```bash
pip install -r requirements.txt
uvicorn main:app
Expand All @@ -39,7 +42,7 @@ uvicorn main:app
The following environment variables are required to run the application:

- `RAG_OPENAI_API_KEY`: The API key for OpenAI API Embeddings (if using default settings).
- Note: `OPENAI_API_KEY` will work but `RAG_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
- Note: `OPENAI_API_KEY` will work but `RAG_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
- `RAG_OPENAI_BASEURL`: (Optional) The base URL for your OpenAI API Embeddings
- `RAG_OPENAI_PROXY`: (Optional) Proxy for OpenAI API Embeddings
- `VECTOR_DB_TYPE`: (Optional) select vector database type, default to `pgvector`.
Expand All @@ -51,6 +54,7 @@ The following environment variables are required to run the application:
- `RAG_HOST`: (Optional) The hostname or IP address where the API server will run. Defaults to "0.0.0.0"
- `RAG_PORT`: (Optional) The port number where the API server will run. Defaults to port 8000.
- `JWT_SECRET`: (Optional) The secret key used for verifying JWT tokens for requests.

- The secret is only used for verification. This basic approach assumes a signed JWT from elsewhere.
- Omit to run API without requiring authentication

Expand All @@ -63,19 +67,19 @@ The following environment variables are required to run the application:
- `CONSOLE_JSON`: (Optional) Set to "True" to log as json for Cloud Logging aggregations
- `EMBEDDINGS_PROVIDER`: (Optional) either "openai", "bedrock", "azure", "huggingface", "huggingfacetei" or "ollama", where "huggingface" uses sentence_transformers; defaults to "openai"
- `EMBEDDINGS_MODEL`: (Optional) Set a valid embeddings model to use from the configured provider.
- **Defaults**
- openai: "text-embedding-3-small"
- azure: "text-embedding-3-small" (will be used as your Azure Deployment)
- huggingface: "sentence-transformers/all-MiniLM-L6-v2"
- huggingfacetei: "http://huggingfacetei:3000". Hugging Face TEI uses model defined on TEI service launch.
- ollama: "nomic-embed-text"
- bedrock: "amazon.titan-embed-text-v1"
- **Defaults**
- openai: "text-embedding-3-small"
- azure: "text-embedding-3-small" (will be used as your Azure Deployment)
- huggingface: "sentence-transformers/all-MiniLM-L6-v2"
- huggingfacetei: "http://huggingfacetei:3000". Hugging Face TEI uses model defined on TEI service launch.
- ollama: "nomic-embed-text"
- bedrock: "amazon.titan-embed-text-v1"
- `RAG_AZURE_OPENAI_API_VERSION`: (Optional) Default is `2023-05-15`. The version of the Azure OpenAI API.
- `RAG_AZURE_OPENAI_API_KEY`: (Optional) The API key for Azure OpenAI service.
- Note: `AZURE_OPENAI_API_KEY` will work but `RAG_AZURE_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
- Note: `AZURE_OPENAI_API_KEY` will work but `RAG_AZURE_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
- `RAG_AZURE_OPENAI_ENDPOINT`: (Optional) The endpoint URL for Azure OpenAI service, including the resource.
- Example: `https://YOUR_RESOURCE_NAME.openai.azure.com`.
- Note: `AZURE_OPENAI_ENDPOINT` will work but `RAG_AZURE_OPENAI_ENDPOINT` will override it in order to not conflict with LibreChat setting.
- Example: `https://YOUR_RESOURCE_NAME.openai.azure.com`.
- Note: `AZURE_OPENAI_ENDPOINT` will work but `RAG_AZURE_OPENAI_ENDPOINT` will override it in order to not conflict with LibreChat setting.
- `HF_TOKEN`: (Optional) if needed for `huggingface` option.
- `OLLAMA_BASE_URL`: (Optional) defaults to `http://ollama:11434`.
- `ATLAS_SEARCH_INDEX`: (Optional) the name of the vector search index if using Atlas MongoDB, defaults to `vector_index`
Expand All @@ -84,6 +88,22 @@ The following environment variables are required to run the application:
- `AWS_ACCESS_KEY_ID`: (Optional) needed for bedrock embeddings
- `AWS_SECRET_ACCESS_KEY`: (Optional) needed for bedrock embeddings

- `DOC_FLTR_ENABLED`: Enables or disables sensitivity label filtering.
- Type: boolean
- Accepted values: true, 1, yes (case-insensitive)
- Default: false if not set
- `DOC_FLTR_ALLOWED_LABELS`: A JSON array of allowed sensitivity labels. If a document's label is not included in this list, it will be rejected when filtering is enabled.
- Type: JSON list of strings
- Format: Must be a valid JSON array (e.g., ["public", "confidential"])
- Note: Labels are normalized (trimmed and lowercased). Special characters and spaces are allowed.
- Default: If unset or an empty array, all labels are allowed
- Example: `DOC_FLTR_ALLOWED_LABELS=["public", "personal", "confidential", "company name - confidential"]`
- `DOC_FLTR_FILE_TYPES`: A JSON array of allowed file extensions (e.g., "pdf", "docx"). Only these types will be checked for labels.
- Type: JSON list of strings
- Note: File extensions should be lowercase and without dots.
- Default: If unset, defaults to ["pdf", "docx", "xlsx", "pptx"]
- Example: `DOC_FLTR_FILE_TYPES=["pdf", "docx"]`

Make sure to set these environment variables before running the application. You can set them in a `.env` file or as system environment variables.

### Use Atlas MongoDB as Vector Database
Expand All @@ -97,7 +117,7 @@ COLLECTION_NAME=<vector collection>
ATLAS_SEARCH_INDEX=<vector search index>
```

The `ATLAS_MONGO_DB_URI` could be the same or different from what is used by LibreChat. Even if it is the same, the `$COLLECTION_NAME` collection needs to be a completely new one, separate from all collections used by LibreChat. In addition, create a vector search index for collection above (remember to assign `$ATLAS_SEARCH_INDEX`) with the following json:
The `ATLAS_MONGO_DB_URI` could be the same or different from what is used by LibreChat. Even if it is the same, the `$COLLECTION_NAME` collection needs to be a completely new one, separate from all collections used by LibreChat. In addition, create a vector search index for collection above (remember to assign `$ATLAS_SEARCH_INDEX`) with the following json:

```json
{
Expand All @@ -118,31 +138,32 @@ The `ATLAS_MONGO_DB_URI` could be the same or different from what is used by Lib

Follow one of the [four documented methods](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/#procedure) to create the vector index.


### Cloud Installation Settings:

#### AWS:

Make sure your RDS Postgres instance adheres to this requirement:

`The pgvector extension version 0.5.0 is available on database instances in Amazon RDS running PostgreSQL 15.4-R2 and higher, 14.9-R2 and higher, 13.12-R2 and higher, and 12.16-R2 and higher in all applicable AWS Regions, including the AWS GovCloud (US) Regions.`

In order to setup RDS Postgres with RAG API, you can follow these steps:

* Create a RDS Instance/Cluster using the provided [AWS Documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateDBInstance.html).
* Login to the RDS Cluster using the Endpoint connection string from the RDS Console or from your IaC Solution output.
* The login is via the *Master User*.
* Create a dedicated database for rag_api:
``` create database rag_api;```.
* Create a dedicated user\role for that database:
``` create role rag;```
- Create a RDS Instance/Cluster using the provided [AWS Documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateDBInstance.html).
- Login to the RDS Cluster using the Endpoint connection string from the RDS Console or from your IaC Solution output.
- The login is via the _Master User_.
- Create a dedicated database for rag_api:
` create database rag_api;`.
- Create a dedicated user\role for that database:
` create role rag;`

* Switch to the database you just created: ```\c rag_api```
* Enable the Vector extension: ```create extension vector;```
* Use the documentation provided above to set up the connection string to the RDS Postgres Instance\Cluster.
- Switch to the database you just created: `\c rag_api`
- Enable the Vector extension: `create extension vector;`
- Use the documentation provided above to set up the connection string to the RDS Postgres Instance\Cluster.

Notes:
* Even though you're logging with a Master user, it doesn't have all the super user privileges, that's why we cannot use the command: ```create role x with superuser;```
* If you do not enable the extension, rag_api service will throw an error that it cannot create the extension due to the note above.

- Even though you're logging with a Master user, it doesn't have all the super user privileges, that's why we cannot use the command: `create role x with superuser;`
- If you do not enable the extension, rag_api service will throw an error that it cannot create the extension due to the note above.

### Dev notes:

Expand Down
14 changes: 12 additions & 2 deletions app/routes/document_routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@

router = APIRouter()


@router.get("/ids")
async def get_all_ids():
try:
Expand Down Expand Up @@ -372,6 +371,17 @@ async def embed_file(
chunk_size = 64 * 1024 # 64 KB
while content := await file.read(chunk_size):
await temp_file.write(content)

# Run Sensitivity Check BEFORE processing
if os.getenv("DOC_FLTR_ENABLED"):
# Lazy import: only load sensitivity functions when filtering is enabled
from app.utils.sensitivity import detect_sensitivity_label, assert_sensitivity_allowed

sensitivity_label = await detect_sensitivity_label(temp_file_path, file.filename)
assert_sensitivity_allowed(sensitivity_label)

logger.debug("File sensitivity label: %s", sensitivity_label)

except Exception as e:
logger.error(
"Failed to save uploaded file | Path: %s | Error: %s | Traceback: %s",
Expand Down Expand Up @@ -603,4 +613,4 @@ async def query_embeddings_by_file_ids(body: QueryMultipleBody):
str(e),
traceback.format_exc(),
)
raise HTTPException(status_code=500, detail=str(e))
raise HTTPException(status_code=500, detail=str(e))
132 changes: 132 additions & 0 deletions app/utils/sensitivity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
import os
from typing import Optional
from fastapi import HTTPException
from dotenv import load_dotenv
from app.config import logger
import zipfile
import pikepdf
from lxml import etree
from xml.etree import ElementTree as ET
import json

# Load .env
load_dotenv()

def get_env_json_list(key: str) -> list[str]:
raw_value = os.getenv(key)
try:
return [item.strip().lower() for item in json.loads(raw_value)] if raw_value else []
except json.JSONDecodeError:
logger.warning(f"Failed to parse {key} as JSON list.")
return []

def get_env_bool(key: str, default: bool = False) -> bool:
val = os.getenv(key)
return val.lower() in ("1", "true", "yes") if val is not None else default

# Configuration
DOC_FLTR_ENABLED = get_env_bool("DOC_FLTR_ENABLED")
DOC_FLTR_ALLOWED_LABELS = get_env_json_list("DOC_FLTR_ALLOWED_LABELS")
DOC_FLTR_FILE_TYPES = get_env_json_list("DOC_FLTR_FILE_TYPES")

SUPPORTED_FILE_TYPES = ["pdf", "docx", "xlsx", "pptx"]

def normalize_label(label: Optional[str]) -> str:
return label.strip().lower() if label else ""

def is_label_allowed(label: Optional[str]) -> bool:
if label is None:
return True # Always allow files with no label

if not DOC_FLTR_ENABLED:
return True

if not DOC_FLTR_ALLOWED_LABELS:
return True # If filtering is on but no labels are defined, allow all

normalized = normalize_label(label)
return normalized in DOC_FLTR_ALLOWED_LABELS

def is_doc_type_allowed(filename: str) -> bool:
file_ext = filename.split('.')[-1].lower()
if DOC_FLTR_FILE_TYPES:
return file_ext in DOC_FLTR_FILE_TYPES
return file_ext in SUPPORTED_FILE_TYPES

def assert_sensitivity_allowed(sensitivity_label: str):
if is_label_allowed(sensitivity_label):
return

raise HTTPException(
status_code=400,
detail=f"File not processed due to unauthorized sensitivity level: {sensitivity_label}."
)

# -------------------------------------------------------
# 📁 Sensitivity Label Extractor
# -------------------------------------------------------

async def detect_sensitivity_label(file_path: str, filename: str) -> Optional[str]:
if not DOC_FLTR_ENABLED:
return None

if not is_doc_type_allowed(filename):
logger.warning(f"Document type {filename.split('.')[-1]} is not allowed for sensitivity check.")
return None

if filename.endswith(".docx") or filename.endswith(".xlsx") or filename.endswith(".pptx"):
return extract_office_sensitivity_label(file_path)
elif filename.endswith(".pdf"):
return extract_pdf_sensitivity_label(file_path)

return None

def extract_office_sensitivity_label(file_path: str) -> Optional[str]:
try:
with zipfile.ZipFile(file_path, "r") as zipf:
if "docProps/custom.xml" in zipf.namelist():
with zipf.open("docProps/custom.xml") as custom_file:
xml_content = custom_file.read().decode("utf-8")
tree = ET.fromstring(xml_content)

ns = {
'cp': 'http://schemas.openxmlformats.org/officeDocument/2006/custom-properties',
'vt': 'http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes'
}

for prop in tree.findall("cp:property", ns):
name = prop.attrib.get("name", "")
if name.endswith("_Name") or "ClassificationWatermarkText" in name:
value_elem = prop.find("vt:lpwstr", ns)
return value_elem.text.strip().lower()
except Exception as e:
logger.warning("Failed to extract Office label: %s", str(e))

return None

def extract_pdf_sensitivity_label(file_path: str) -> Optional[str]:
try:
with pikepdf.open(file_path) as pdf:
xmp = pdf.open_metadata()
xml_content = str(xmp)

tree = ET.fromstring(xml_content)

ns = {
'pdfx': 'http://ns.adobe.com/pdfx/1.3/',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
}

for description in tree.findall('.//rdf:Description', ns):
for key, value in description.attrib.items():
for elem in description:
tag = elem.tag
if tag.startswith('{%s}' % ns['pdfx']) and tag.endswith('_Name') and elem.text:
label = elem.text.strip()
logger.info(f"Found sensitivity label: {label}")
return label

except Exception as e:
logger.warning("Failed to extract PDF label: %s", str(e))

return None
6 changes: 5 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,8 @@ cryptography==44.0.1
python-magic==0.4.27
python-pptx==0.6.23
xlrd==2.0.1
pydantic==2.9.2
pydantic==2.9.2
pikepdf
python-docx
lxml
zipfile36