danny-avila · jare5d · Apr 12, 2025 · Apr 14, 2025 · Apr 14, 2025 · Apr 14, 2025
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # ID-based RAG FastAPI
 
 ## Overview
+
 This project integrates Langchain with FastAPI in an Asynchronous, Scalable manner, providing a framework for document indexing and retrieval, using PostgreSQL/pgvector.
 
 Files are organized into embeddings by `file_id`. The primary use case is for integration with [LibreChat](https://librechat.ai), but this simple API can be used for any ID-based use case.
@@ -10,6 +11,7 @@ The main reason to use the ID approach is to work with embeddings on a file-leve
 The API will evolve over time to employ different querying/re-ranking methods, embedding models, and vector stores.
 
 ## Features
+
 - **Document Management**: Methods for adding, retrieving, and deleting documents.
 - **Vector Store**: Utilizes Langchain's vector store for efficient document retrieval.
 - **Asynchronous Support**: Offers async operations for enhanced performance.
@@ -29,6 +31,7 @@ The API will evolve over time to employ different querying/re-ranking methods, e
   - Local:
     - Make sure to setup `DB_HOST` to the correct database hostname
     - Run the following commands (preferably in a [virtual environment](https://realpython.com/python-virtual-environments-a-primer/))
+
 ```bash
 pip install -r requirements.txt
 uvicorn main:app
@@ -39,7 +42,7 @@ uvicorn main:app
 The following environment variables are required to run the application:
 
 - `RAG_OPENAI_API_KEY`: The API key for OpenAI API Embeddings (if using default settings).
-    - Note: `OPENAI_API_KEY` will work but `RAG_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
+  - Note: `OPENAI_API_KEY` will work but `RAG_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
 - `RAG_OPENAI_BASEURL`: (Optional) The base URL for your OpenAI API Embeddings
 - `RAG_OPENAI_PROXY`: (Optional) Proxy for OpenAI API Embeddings
 - `VECTOR_DB_TYPE`: (Optional) select vector database type, default to `pgvector`.
@@ -51,6 +54,7 @@ The following environment variables are required to run the application:
 - `RAG_HOST`: (Optional) The hostname or IP address where the API server will run. Defaults to "0.0.0.0"
 - `RAG_PORT`: (Optional) The port number where the API server will run. Defaults to port 8000.
 - `JWT_SECRET`: (Optional) The secret key used for verifying JWT tokens for requests.
+
   - The secret is only used for verification. This basic approach assumes a signed JWT from elsewhere.
   - Omit to run API without requiring authentication
 
@@ -63,19 +67,19 @@ The following environment variables are required to run the application:
 - `CONSOLE_JSON`: (Optional) Set to "True" to log as json for Cloud Logging aggregations
 - `EMBEDDINGS_PROVIDER`: (Optional) either "openai", "bedrock", "azure", "huggingface", "huggingfacetei" or "ollama", where "huggingface" uses sentence_transformers; defaults to "openai"
 - `EMBEDDINGS_MODEL`: (Optional) Set a valid embeddings model to use from the configured provider.
-    - **Defaults**
-    - openai: "text-embedding-3-small"
-    - azure: "text-embedding-3-small" (will be used as your Azure Deployment)
-    - huggingface: "sentence-transformers/all-MiniLM-L6-v2"
-    - huggingfacetei: "http://huggingfacetei:3000". Hugging Face TEI uses model defined on TEI service launch.
-    - ollama: "nomic-embed-text"
-    - bedrock: "amazon.titan-embed-text-v1"
+  - **Defaults**
+  - openai: "text-embedding-3-small"
+  - azure: "text-embedding-3-small" (will be used as your Azure Deployment)
+  - huggingface: "sentence-transformers/all-MiniLM-L6-v2"
+  - huggingfacetei: "http://huggingfacetei:3000". Hugging Face TEI uses model defined on TEI service launch.
+  - ollama: "nomic-embed-text"
+  - bedrock: "amazon.titan-embed-text-v1"
 - `RAG_AZURE_OPENAI_API_VERSION`: (Optional) Default is `2023-05-15`. The version of the Azure OpenAI API.
 - `RAG_AZURE_OPENAI_API_KEY`: (Optional) The API key for Azure OpenAI service.
-    - Note: `AZURE_OPENAI_API_KEY` will work but `RAG_AZURE_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
+  - Note: `AZURE_OPENAI_API_KEY` will work but `RAG_AZURE_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
 - `RAG_AZURE_OPENAI_ENDPOINT`: (Optional) The endpoint URL for Azure OpenAI service, including the resource.
-    - Example: `https://YOUR_RESOURCE_NAME.openai.azure.com`.
-    - Note: `AZURE_OPENAI_ENDPOINT` will work but `RAG_AZURE_OPENAI_ENDPOINT` will override it in order to not conflict with LibreChat setting.
+  - Example: `https://YOUR_RESOURCE_NAME.openai.azure.com`.
+  - Note: `AZURE_OPENAI_ENDPOINT` will work but `RAG_AZURE_OPENAI_ENDPOINT` will override it in order to not conflict with LibreChat setting.
 - `HF_TOKEN`: (Optional) if needed for `huggingface` option.
 - `OLLAMA_BASE_URL`: (Optional) defaults to `http://ollama:11434`.
 - `ATLAS_SEARCH_INDEX`: (Optional) the name of the vector search index if using Atlas MongoDB, defaults to `vector_index`
@@ -84,6 +88,22 @@ The following environment variables are required to run the application:
 - `AWS_ACCESS_KEY_ID`: (Optional) needed for bedrock embeddings
 - `AWS_SECRET_ACCESS_KEY`: (Optional) needed for bedrock embeddings
 
+- `DOC_FLTR_ENABLED`: Enables or disables sensitivity label filtering.
+  - Type: boolean
+  - Accepted values: true, 1, yes (case-insensitive)
+  - Default: false if not set
+- `DOC_FLTR_ALLOWED_LABELS`: A JSON array of allowed sensitivity labels. If a document's label is not included in this list, it will be rejected when filtering is enabled.
+  - Type: JSON list of strings
+  - Format: Must be a valid JSON array (e.g., ["public", "confidential"])
+  - Note: Labels are normalized (trimmed and lowercased). Special characters and spaces are allowed.
+  - Default: If unset or an empty array, all labels are allowed
+  - Example: `DOC_FLTR_ALLOWED_LABELS=["public", "personal", "confidential", "company name - confidential"]`
+- `DOC_FLTR_FILE_TYPES`: A JSON array of allowed file extensions (e.g., "pdf", "docx"). Only these types will be checked for labels.
+  - Type: JSON list of strings
+  - Note: File extensions should be lowercase and without dots.
+  - Default: If unset, defaults to ["pdf", "docx", "xlsx", "pptx"]
+  - Example: `DOC_FLTR_FILE_TYPES=["pdf", "docx"]`
+
 Make sure to set these environment variables before running the application. You can set them in a `.env` file or as system environment variables.
 
 ### Use Atlas MongoDB as Vector Database
@@ -97,7 +117,7 @@ COLLECTION_NAME=<vector collection>
 ATLAS_SEARCH_INDEX=<vector search index>
 ```
 
-The `ATLAS_MONGO_DB_URI` could be the same or different from what is used by LibreChat. Even if it is the same, the `$COLLECTION_NAME` collection needs to be a completely new one, separate from all collections used by LibreChat. In addition,  create a vector search index for collection above (remember to assign `$ATLAS_SEARCH_INDEX`) with the following json:
+The `ATLAS_MONGO_DB_URI` could be the same or different from what is used by LibreChat. Even if it is the same, the `$COLLECTION_NAME` collection needs to be a completely new one, separate from all collections used by LibreChat. In addition, create a vector search index for collection above (remember to assign `$ATLAS_SEARCH_INDEX`) with the following json:
 
 ```json
 {
@@ -118,31 +138,32 @@ The `ATLAS_MONGO_DB_URI` could be the same or different from what is used by Lib
 
 Follow one of the [four documented methods](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/#procedure) to create the vector index.
 
-
 ### Cloud Installation Settings:
 
 #### AWS:
+
 Make sure your RDS Postgres instance adheres to this requirement:
 
 `The pgvector extension version 0.5.0 is available on database instances in Amazon RDS running PostgreSQL 15.4-R2 and higher, 14.9-R2 and higher, 13.12-R2 and higher, and 12.16-R2 and higher in all applicable AWS Regions, including the AWS GovCloud (US) Regions.`
 
 In order to setup RDS Postgres with RAG API, you can follow these steps:
 
-* Create a RDS Instance/Cluster using the provided [AWS Documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateDBInstance.html).
-* Login to the RDS Cluster using the Endpoint connection string from the RDS Console or from your IaC Solution output.
-* The login is via the *Master User*.
-* Create a dedicated database for rag_api:
-``` create database rag_api;```.
-* Create a dedicated user\role for that database:
-``` create role rag;```
+- Create a RDS Instance/Cluster using the provided [AWS Documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateDBInstance.html).
+- Login to the RDS Cluster using the Endpoint connection string from the RDS Console or from your IaC Solution output.
+- The login is via the _Master User_.
+- Create a dedicated database for rag_api:
+  ` create database rag_api;`.
+- Create a dedicated user\role for that database:
+  ` create role rag;`
 
-* Switch to the database you just created: ```\c rag_api```
-* Enable the Vector extension: ```create extension vector;```
-* Use the documentation provided above to set up the connection string to the RDS Postgres Instance\Cluster.
+- Switch to the database you just created: `\c rag_api`
+- Enable the Vector extension: `create extension vector;`
+- Use the documentation provided above to set up the connection string to the RDS Postgres Instance\Cluster.
 
 Notes:
-  * Even though you're logging with a Master user, it doesn't have all the super user privileges, that's why we cannot use the command: ```create role x with superuser;```
-  * If you do not enable the extension, rag_api service will throw an error that it cannot create the extension due to the note above.
+
+- Even though you're logging with a Master user, it doesn't have all the super user privileges, that's why we cannot use the command: `create role x with superuser;`
+- If you do not enable the extension, rag_api service will throw an error that it cannot create the extension due to the note above.
 
 ### Dev notes:
 

diff --git a/app/routes/document_routes.py b/app/routes/document_routes.py
@@ -36,7 +36,6 @@
 
 router = APIRouter()
 
-
 @router.get("/ids")
 async def get_all_ids():
     try:
@@ -372,6 +371,17 @@ async def embed_file(
             chunk_size = 64 * 1024  # 64 KB
             while content := await file.read(chunk_size):
                 await temp_file.write(content)
+
+        # Run Sensitivity Check BEFORE processing
+        if os.getenv("DOC_FLTR_ENABLED"):
+            # Lazy import: only load sensitivity functions when filtering is enabled
+            from app.utils.sensitivity import detect_sensitivity_label, assert_sensitivity_allowed
+
+            sensitivity_label = await detect_sensitivity_label(temp_file_path, file.filename)
+            assert_sensitivity_allowed(sensitivity_label)
+
+            logger.debug("File sensitivity label: %s", sensitivity_label)
+
     except Exception as e:
         logger.error(
             "Failed to save uploaded file | Path: %s | Error: %s | Traceback: %s",
@@ -603,4 +613,4 @@ async def query_embeddings_by_file_ids(body: QueryMultipleBody):
             str(e),
             traceback.format_exc(),
         )
-        raise HTTPException(status_code=500, detail=str(e))
+        raise HTTPException(status_code=500, detail=str(e))
diff --git a/app/utils/sensitivity.py b/app/utils/sensitivity.py
@@ -0,0 +1,132 @@
+import os
+from typing import Optional
+from fastapi import HTTPException
+from dotenv import load_dotenv
+from app.config import logger
+import zipfile
+import pikepdf
+from lxml import etree
+from xml.etree import ElementTree as ET
+import json
+
+# Load .env
+load_dotenv()
+
+def get_env_json_list(key: str) -> list[str]:
+    raw_value = os.getenv(key)
+    try:
+        return [item.strip().lower() for item in json.loads(raw_value)] if raw_value else []
+    except json.JSONDecodeError:
+        logger.warning(f"Failed to parse {key} as JSON list.")
+        return []
+
+def get_env_bool(key: str, default: bool = False) -> bool:
+    val = os.getenv(key)
+    return val.lower() in ("1", "true", "yes") if val is not None else default
+
+# Configuration
+DOC_FLTR_ENABLED = get_env_bool("DOC_FLTR_ENABLED")
+DOC_FLTR_ALLOWED_LABELS = get_env_json_list("DOC_FLTR_ALLOWED_LABELS")
+DOC_FLTR_FILE_TYPES = get_env_json_list("DOC_FLTR_FILE_TYPES")
+
+SUPPORTED_FILE_TYPES = ["pdf", "docx", "xlsx", "pptx"]
+
+def normalize_label(label: Optional[str]) -> str:
+    return label.strip().lower() if label else ""
+
+def is_label_allowed(label: Optional[str]) -> bool:
+    if label is None:
+        return True  # Always allow files with no label
+
+    if not DOC_FLTR_ENABLED:
+        return True
+
+    if not DOC_FLTR_ALLOWED_LABELS:
+        return True  # If filtering is on but no labels are defined, allow all
+
+    normalized = normalize_label(label)
+    return normalized in DOC_FLTR_ALLOWED_LABELS
+
+def is_doc_type_allowed(filename: str) -> bool:
+    file_ext = filename.split('.')[-1].lower()
+    if DOC_FLTR_FILE_TYPES:
+        return file_ext in DOC_FLTR_FILE_TYPES
+    return file_ext in SUPPORTED_FILE_TYPES
+
+def assert_sensitivity_allowed(sensitivity_label: str):
+    if is_label_allowed(sensitivity_label):
+        return
+
+    raise HTTPException(
+        status_code=400,
+        detail=f"File not processed due to unauthorized sensitivity level: {sensitivity_label}."
+    )
+
+# -------------------------------------------------------
+# 📁 Sensitivity Label Extractor
+# -------------------------------------------------------
+
+async def detect_sensitivity_label(file_path: str, filename: str) -> Optional[str]:
+    if not DOC_FLTR_ENABLED:
+        return None
+
+    if not is_doc_type_allowed(filename):
+        logger.warning(f"Document type {filename.split('.')[-1]} is not allowed for sensitivity check.")
+        return None
+
+    if filename.endswith(".docx") or filename.endswith(".xlsx") or filename.endswith(".pptx"):
+        return extract_office_sensitivity_label(file_path)
+    elif filename.endswith(".pdf"):
+        return extract_pdf_sensitivity_label(file_path)
+
+    return None
+
+def extract_office_sensitivity_label(file_path: str) -> Optional[str]:
+    try:
+        with zipfile.ZipFile(file_path, "r") as zipf:
+            if "docProps/custom.xml" in zipf.namelist():
+                with zipf.open("docProps/custom.xml") as custom_file:
+                    xml_content = custom_file.read().decode("utf-8")
+                    tree = ET.fromstring(xml_content)
+
+                    ns = {
+                        'cp': 'http://schemas.openxmlformats.org/officeDocument/2006/custom-properties',
+                        'vt': 'http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes'
+                    }
+
+                    for prop in tree.findall("cp:property", ns):
+                        name = prop.attrib.get("name", "")
+                        if name.endswith("_Name") or "ClassificationWatermarkText" in name:
+                            value_elem = prop.find("vt:lpwstr", ns)
+                            return value_elem.text.strip().lower()
+    except Exception as e:
+        logger.warning("Failed to extract Office label: %s", str(e))
+
+    return None
+
+def extract_pdf_sensitivity_label(file_path: str) -> Optional[str]:
+    try:
+        with pikepdf.open(file_path) as pdf:
+            xmp = pdf.open_metadata()
+            xml_content = str(xmp)
+
+            tree = ET.fromstring(xml_content)
+
+            ns = {
+                'pdfx': 'http://ns.adobe.com/pdfx/1.3/',
+                'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
+            }
+
+            for description in tree.findall('.//rdf:Description', ns):
+                for key, value in description.attrib.items():
+                    for elem in description:
+                        tag = elem.tag
+                        if tag.startswith('{%s}' % ns['pdfx']) and tag.endswith('_Name') and elem.text:
+                            label = elem.text.strip()
+                            logger.info(f"Found sensitivity label: {label}")
+                            return label
+
+    except Exception as e:
+        logger.warning("Failed to extract PDF label: %s", str(e))
+
+    return None
diff --git a/requirements.txt b/requirements.txt
@@ -34,4 +34,8 @@ cryptography==44.0.1
 python-magic==0.4.27
 python-pptx==0.6.23
 xlrd==2.0.1
-pydantic==2.9.2
+pydantic==2.9.2
+pikepdf
+python-docx
+lxml
+zipfile36