Document Ingestion and Text Extraction Service

A comprehensive document analysis tool that combines AWS Textract, Bedrock, and intelligent blur detection. Available as both CLI and serverless Lambda API.

🚀 Quick Start

Prerequisites

Python 3.10+
uv package manager
AWS CLI configured with appropriate permissions

Installation

# Clone and setup
cd document-ingestion-and-text-extraction
uv sync

Useful UV Commands for Environment Management

# Activate the virtual environment
.venv\Scripts\activate

# Check which Python is being used
uv run python --version

# Show environment information
uv run python -c "import sys; print(sys.executable)"

# Create a new virtual environment (if needed)
uv venv

# Sync dependencies
uv sync

# Sync with active environment (ignores conflicting VIRTUAL_ENV)
uv sync --active

Basic Usage

# Analyze a driver's license locally
uv run python cli.py --file media/license.jpeg --mode tfbq --category license

# Test Lambda function locally
uv run python local_test.py

# Test deployed Lambda API
uv run python test_lambda.py --api-url YOUR_API_URL --file media/license.jpeg --mode tfbq --category license

# Test deployed Lambda API with category auto-detection
uv run python test_lambda.py --api-url YOUR_API_URL --file media/license.jpeg

📋 Table of Contents

Local CLI Usage
Lambda API
Features
Project Structure
Quick Reference & Commands
API Usage Examples
Blur Analysis API Field
Auto-Detection Features
API Reference
Configuration
Advanced Usage
Developer Guide: Custom Queries and Prompts
Troubleshooting
Comparison: Local vs Lambda
Contributing

💻 Local CLI Usage

Command Line Interface

# Basic syntax
uv run python cli.py --file <path> --mode <mode> [options]

Arguments

Argument	Description	Default	Required
`--file`	Path to input file (JPEG/PNG/PDF)	-	✅
`--mode`	Analysis mode: t(ext), f(orms), b(tables), q(uery)	`tfbq`	❌
`--category`	Document type: `idcard`, `license`, `license-front`, `license-back`, `tnb`, `receipt` (auto-detected if not provided)	-	❌
`--queries`	Custom queries separated by semicolons or newlines	-	❌
`--prompt`	Custom prompt for Bedrock AI extraction	-	❌
`--custom`	Use custom queries/prompts even if category files exist	`False`	❌
`--region`	AWS region	`us-east-1`	❌
`--profile`	AWS profile name	`default`	❌

Common Local Commands

# Full analysis with auto-detection (no category needed)
uv run python cli.py --file media/license.jpeg --mode tfbq --region us-east-1

# Full analysis of a driver's license (explicit category)
uv run python cli.py --file media/license.jpeg --mode tfbq --category license --region us-east-1

# TNB utility bill analysis
uv run python cli.py --file media/tnb-bill.pdf --mode tfbq --category tnb --region us-east-1

# License front side analysis
uv run python cli.py --file media/license-front.jpeg --mode tfbq --category license-front --region us-east-1

# Text extraction only with blur detection
uv run python cli.py --file media/license.jpeg --mode t --region us-east-1

# Forms and tables analysis
uv run python cli.py --file media/license.jpeg --mode fb --region us-east-1

# Auto-detection with custom queries/prompts
uv run python cli.py --file media/license.jpeg --mode tfbq --custom --queries "What is the issuing authority?" --region us-east-1

☁️ Lambda API

Deployment

Option 1: Serverless Framework (Recommended)

# Install dependencies
npm install -g serverless serverless-python-requirements

set BEDROCK_MODEL=amazon.nova-lite-v1:0 # change to your desired model

# Deploy to AWS
serverless deploy --region us-east-1

Option 2: Manual Deployment

# Create deployment package
python deploy_lambda.py --function-name document-ingestion-and-text-extraction-api --region us-east-1

Testing Lambda

Local Testing

# Test Lambda function locally (simulates Lambda environment)
uv run python local_test.py

# Test health endpoint
uv run python local_test.py --health

Remote Testing

# Test deployed API
uv run python test_lambda.py --api-url https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze --file media/receipt.pdf --mode tfbq --category receipt

# Create web test interface
uv run python test_lambda.py --create-html
# Then open test_lambda.html in browser

LAMBDA RUNTIME

By default, the CLI runs in local mode with OpenCV support for blur detection. To simulate the Lambda environment without OpenCV, set the following environment variable:

SET LAMBDA_RUNTIME = false

For Lambda-like behavior (Textract confidence-based blur detection only), set:

SET LAMBDA_RUNTIME = true

Lambda API Usage

JSON Request Format

curl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "file_content": "<base64-encoded-file-content>",
    "filename": "document.pdf",
    "mode": "tfbq",
    "custom": false,
    "region": "us-east-1"
  }'

Note: May output The input line is too long..

OR using Python script:

uv run python test_api.py

Note: Update base64_content in test_api.py with your base64-encoded file content.

Health Check

curl https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/health

🎯 Features

1. AWS Textract Integration

Text Detection: Extract text with confidence scores
Form Analysis: Key-value pair extraction
Table Analysis: Structured table data extraction
Query Analysis: Answer specific questions about documents
Auto Category Detection: Automatically detect document type using AI

2. Intelligent Blur Detection

Local: OpenCV Laplacian variance + Textract confidence analysis
Lambda: Enhanced Textract confidence analysis with statistical metrics
Metrics: Average, median, and standard deviation of OCR confidence
Quality Assessment: Excellent, good, fair, or poor ratings
API Integration: Structured blur_analysis field in Lambda responses

3. AWS Bedrock Integration

Structured Extraction: Convert documents to structured JSON
Document Categories: Specialized prompts for different document types
Auto Category Detection: AI-powered document classification
Custom Mode: Override category-based prompts and queries
AI-Powered: Uses Claude AI for intelligent data extraction

4. Dual Deployment Options

Local CLI: Full-featured command-line interface
Lambda API: Serverless REST API with automatic scaling
Consistent Results: Same analysis quality in both environments

📁 Project Structure

document-ingestion-and-text-extraction/
├── src/                          # Core source code
│   ├── __init__.py               # Package initialization
│   ├── main.py                   # Main CLI logic
│   ├── textract_enhanced.py      # Textract integration
│   ├── bedrock_mapper.py         # Bedrock integration
│   ├── category_detector.py      # Auto-detection logic
│   ├── blur_detection.py         # Blur detection logic
│   ├── logger.py                 # Logging utilities
│   ├── sample_response.json      # Sample API response for reference
│   ├── prompts/                  # Bedrock prompts
│   │   ├── idcard.txt
│   │   ├── license.txt
│   │   ├── license-front.txt
│   │   ├── license-back.txt
│   │   ├── receipt.txt
│   │   └── tnb.txt
│   └── queries/                  # Textract queries
│       ├── idcard.txt
│       ├── license.txt
│       ├── license-front.txt
│       ├── license-back.txt
│       ├── receipt.txt
│       └── tnb.txt
├── media/                        # Sample test files
│   ├── blur.jpg                  # Blurry test image
│   ├── exceed-5mb.pdf            # Large file test
│   ├── exceed-pages.pdf          # Multi-page test
│   ├── half-blur.jpg             # Partially blurred image
│   ├── license.jpeg              # Driver's license sample
│   ├── mingjia-license.jpg       # License sample
│   ├── receipt.pdf               # Receipt sample
│   ├── tnb.png                   # TNB utility bill sample
│   └── unsupported-file-type.xlsx # Unsupported format test
├── log/                          # Local analysis results
│   └── {filename}_{timestamp}/   # Individual analysis logs
│       ├── textract.log          # Complete processing log
│       ├── text.json             # Text detection results
│       ├── forms.json            # Form analysis results
│       ├── tables.json           # Table analysis results
│       ├── queries.json          # Query analysis results
│       ├── blur_analysis.json    # Blur detection results
│       └── category_detection.json # Auto-detection results
├── output/                       # Extracted structured data
├── .env                          # Environment variables (not tracked)
├── .gitignore                    # Git ignore patterns
├── api.py                        # Standalone API server
├── cli.py                        # CLI entry point
├── lambda_handler.py             # Lambda function handler
├── local_test.py                 # Local Lambda testing
├── test_api.py                   # API testing with requests
├── test_lambda.py                # Lambda API testing
├── test_lambda.html              # Web-based Lambda testing interface (Generate with `python test_lambda.py --create-html`)
├── deploy_lambda.py              # Deployment script
├── serverless.yml                # Serverless Framework config
├── package.json                  # Node.js dependencies for Serverless
├── package-lock.json             # Node.js lock file
├── pyproject.toml                # Project dependencies
├── requirements.txt              # Lambda-specific dependencies
├── uv.lock                       # UV package manager lock file
└── README.md                     # This file

🚀 Quick Reference & Commands

Essential Commands

# Basic Setup
uv sync                                    # Install/update dependencies

# Local Analysis (Recommended)
uv run python cli.py --file media/license.jpeg --mode tfbq    # Auto-detect + full analysis
uv run python cli.py --file media/receipt.pdf --mode tfbq     # Receipt analysis
uv run python cli.py --file media/document.pdf --mode t       # Text extraction only

# Lambda Testing
uv run python local_test.py                                   # Test Lambda locally
serverless deploy --region us-east-1                          # Deploy to AWS
uv run python test_lambda.py --api-url YOUR_URL --file media/license.jpeg

Analysis Modes

Mode	Description	Use Case	Speed
`t`	Text detection only	Quick text extraction	⚡⚡⚡
`f`	Forms analysis	Key-value pairs	⚡⚡
`b`	Tables analysis	Structured table data	⚡⚡
`q`	Query analysis	Answer specific questions	⚡
`tfbq`	All analysis types	Complete document analysis (recommended)	⚡

Document Categories

Category	Documents	Auto-Detect	Manual Specify
Auto	All supported documents	✅ Recommended	`--mode tfbq`
`license`	Driver's licenses (any side)	✅	`--category license`
`license-front`	License front side only	✅	`--category license-front`
`license-back`	License back side only	✅	`--category license-back`
`idcard`	ID cards, national IDs	✅	`--category idcard`
`receipt`	Purchase receipts	✅	`--category receipt`
`tnb`	TNB utility bills	✅	`--category tnb`

Command Examples by Use Case

# Quick Analysis (Most Common)
uv run python cli.py --file document.pdf --mode tfbq          # Full auto-analysis
uv run python cli.py --file document.pdf --mode t            # Text only (fastest)

# Specific Document Types
uv run python cli.py --file license.jpg --mode tfbq --category license
uv run python cli.py --file receipt.pdf --mode tfbq --category receipt
uv run python cli.py --file bill.pdf --mode tfbq --category tnb

# Custom Analysis
uv run python cli.py --file document.pdf --mode q --queries "What is the date?;What is the amount?"
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract as JSON: name, date, amount"

# API Testing
uv run python local_test.py                                   # Test locally
uv run python test_lambda.py --create-html                    # Create web interface

🌐 API Usage Examples

JavaScript (Browser)

// File upload and analysis
const fileInput = document.getElementById('file');
const file = fileInput.files[0];
const reader = new FileReader();

reader.onload = async function(e) {
  const fileContent = e.target.result.split(',')[1];

  const response = await fetch('https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      file_content: fileContent,
      filename: file.name,
      mode: 'tfbq',
      custom: False,
      region: 'us-east-1'
    })
  });

  const result = await response.json();

  // Access blur analysis
  if (result.blur_analysis) {
    const blur = result.blur_analysis;
    const textract = blur.textract_analysis;
    const overall = blur.overall_assessment;

    console.log(`Quality: ${textract.quality_assessment}`);
    console.log(`Is Blurry: ${overall.is_blurry}`);
    console.log(`Confidence: ${overall.confidence_level}`);
    console.log(`Median Confidence: ${textract.median_confidence.toFixed(2)}%`);

    // Quality-based processing
    if (textract.quality_assessment === 'excellent') {
      console.log('High quality image - proceed with confidence');
    } else if (overall.is_blurry) {
      console.log('Blurry image detected - results may be less accurate');
    }
  }
};

reader.readAsDataURL(file);

Python (Requests)

import requests
import base64

# Read and encode file
with open('media/license.jpeg', 'rb') as f:
    file_content = base64.b64encode(f.read()).decode('utf-8')

# Make API call
response = requests.post(
    'https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze',
    json={
        'file_content': file_content,
        'filename': 'license.jpeg',
        'mode': 'tfbq',
        'custom': False,
        'region': 'us-east-1'
    }
)

result = response.json()

# Access blur analysis
if 'blur_analysis' in result:
    blur_info = result['blur_analysis']
    print(f"Quality: {blur_info['textract_analysis']['quality_assessment']}")
    print(f"Is Blurry: {blur_info['overall_assessment']['is_blurry']}")
    print(f"Confidence: {blur_info['overall_assessment']['confidence_level']}")

cURL

# Encode file to base64
FILE_CONTENT=$(base64 -w 0 media/license.jpeg)

# Make API call
curl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze \
  -H "Content-Type: application/json" \
  -d "{
    \"file_content\": \"$FILE_CONTENT\",
    \"filename\": \"license.jpeg\",
    \"mode\": \"tfbq\",
    \"custom\": false,
    \"region\": \"us-east-1\"
  }"

📊 Blur Analysis API Field

The Lambda API now includes a dedicated blur_analysis field that provides comprehensive image quality assessment:

{
  "blur_analysis": {
    "laplacian": {
      "method": "laplacian",
      "score": 4743.317898724083,
      "is_blurry": false,
      "quality": "sharp"
    },
    "textract_analysis": {
      "total_items": 53,
      "min_confidence": 34.21656036376953,
      "max_confidence": 99.990234375,
      "median_confidence": 96.84188079833984,
      "average_confidence": 95.69545777638753,
      "std_confidence": 11.519045489650455,
      "low_confidence_count": 22,
      "low_confidence_percentage": 41.509433962264154,
      "likely_blurry": false,
      "quality_assessment": "excellent"
    },
    "overall_assessment": {
      "is_blurry": false,
      "blur_indicators": [],
      "confidence_level": "high"
    }
  }
}

Blur Analysis Complete Structure

Laplacian Analysis (Local only, defaults in Lambda)

Field	Type	Description	Values
`method`	string	Analysis method used	`"laplacian"`
`score`	float	Laplacian variance score	`0.0+` (higher = sharper)
`is_blurry`	boolean	Laplacian-based blur detection	`true`, `false`
`quality`	string	Laplacian-based quality assessment	`"sharp"`, `"moderate"`, `"blurry"`

Textract Analysis

Field	Type	Description	Values
`total_items`	integer	Number of text items detected	`0+`
`min_confidence`	float	Lowest confidence score	`0.0 - 100.0`
`max_confidence`	float	Highest confidence score	`0.0 - 100.0`
`median_confidence`	float	Median confidence score	`0.0 - 100.0`
`average_confidence`	float	Average confidence score	`0.0 - 100.0`
`std_confidence`	float	Standard deviation of confidence scores	`0.0+`
`low_confidence_count`	integer	Number of items below 85% confidence	`0+`
`low_confidence_percentage`	float	Percentage of low-confidence items	`0.0 - 100.0`
`likely_blurry`	boolean	Textract-based blur assessment	`true`, `false`
`quality_assessment`	string	Overall quality rating	`"excellent"`, `"good"`, `"fair"`, `"poor"`

Overall Assessment

Field	Type	Description	Values
`is_blurry`	boolean	Final blur detection result	`true`, `false`
`blur_indicators`	array	Methods that detected blur	`[]`, `["textract"]`, `["laplacian"]`, `["laplacian", "textract"]`
`confidence_level`	string	Confidence in the assessment	`"high"`, `"medium"`, `"low"`

Blur Detection Algorithm

Quality Assessment Criteria

Quality	Median Confidence	Average Confidence	Description
Excellent	> 95%	> 90%	Very high quality, clear text
Good	> 90%	> 85%	Good quality, readable text
Fair	> 85%	> 80%	Acceptable quality, mostly readable
Poor	≤ 85%	≤ 80%	Poor quality, difficult to read

Blur Detection Logic

An image is considered blurry if ANY of these conditions are met:

Very low median confidence: median_confidence < 80.0
Very low average confidence: average_confidence < 75.0
High percentage of poor items: low_confidence_percentage > 50.0 (>50% below 85%)
Extreme inconsistency with poor quality: std_confidence > 20.0 AND median_confidence < 85.0

Confidence Level Determination

Level	Median	Average	Low Conf %	Description
High	> 95%	> 90%	< 20%	Very confident assessment
High	> 90%	> 85%	< 35%	Confident assessment
Medium	> 85%	> 80%	< 50%	Moderately confident
Low	≤ 85%	≤ 80%	≥ 50%	Low confidence assessment

Blur Indicators Interpretation

Indicators	Meaning	Confidence
`[]`	No blur detected by any method	High
`["textract"]`	Only confidence analysis detected blur	Medium
`["laplacian"]`	Only image analysis detected blur (local only)	Medium
`["laplacian", "textract"]`	Both methods detected blur (local only)	Very High

🔄 Auto-Detection Features

Document Category Detection

The system automatically detects document categories using AI analysis of extracted text, forms, and tables:

# Auto-detection (recommended - no --category needed)
uv run python cli.py --file document.pdf --mode tfbq

How it works:

Initial Analysis: Extracts text, forms, and tables using Textract
AI Classification: Uses Bedrock Claude AI to analyze content and classify document type
Category Assignment: Applies detected category for queries and prompts
Results Saved: Detection results saved to category_detection.json

Supported Categories:

idcard - Identity cards, national IDs, employee IDs
license - Driver's license, driving permits (combined/single-sided)
license-front - Front side of driver's license specifically
license-back - Back side of driver's license specifically
tnb - TNB utility bills, electricity bills
receipt - Purchase receipts, invoices from retail stores

Detection Confidence:

High confidence (0.7-1.0): Very reliable classification
Medium confidence (0.4-0.7): Moderately reliable
Low confidence (0.0-0.4): Less reliable, may need manual verification

Manual Category Override

# Specify category explicitly (skips auto-detection)
uv run python cli.py --file document.pdf --mode tfbq --category tnb

Custom Mode

Use --custom to override category-based files with your own queries/prompts:

# Custom mode with explicit queries (ignores category query files)
uv run python cli.py --file document.pdf --mode q --custom --queries "What is the date?;What is the amount?"

# Custom mode with explicit prompt (ignores category prompt files)
uv run python cli.py --file document.pdf --mode tfb --custom --prompt "Extract all dates as JSON"

# Custom mode for categories without extensive default files
uv run python cli.py --file license-back.jpeg --mode q --category license-back --queries "What is the license number?"

Custom Mode Rules:

If --custom is used and no custom queries/prompts provided, system checks for category files
All new categories have supporting files, but --custom can override them
Use custom mode to test new queries or prompts for existing categories

📚 API Reference

Local CLI Response

=== TEXT DETECTION ===
text = "Sample Text" | confidence = 99.89

=== FORM ANALYSIS ===
Key: Value pairs extracted from forms

=== TABLE ANALYSIS ===
Structured table data with rows and columns

=== QUERY ANALYSIS ===
Q: What is the transaction amount?
A: $100.00

=== BLUR DETECTION ===
Textract confidence - Median: 99.89, Avg: 99.75, Std: 0.51
Quality assessment: excellent
Overall: CLEAR (confidence: high)

=== BEDROCK EXTRACTION ===
{
  "transaction_date": "2025-09-15",
  "transaction_amount": "$100.00",
  "beneficiary_name": "John Doe"
}

Lambda API Response

{
  "status": "success",
  "console_output": "Processing log...",
  "text": [{ "text": "Sample Text", "confidence": 99.89 }],
  "forms": {
    "Key": ["Value"]
  },
  "tables": {
    "tables": [{ "table_id": 1, "rows": [["Cell1", "Cell2"]] }]
  },
  "queries": {
    "What is the amount?": "$100.00"
  },
  "category_detection": {
    "detected_category": "receipt",
    "confidence": 0.95,
    "timestamp": "2025-09-15T12:34:56+00:00"
  },
  "blur_analysis": {
    "textract_analysis": {
      "median_confidence": 99.89,
      "average_confidence": 99.62,
      "std_confidence": 0.51,
      "quality_assessment": "excellent"
    },
    "overall_assessment": {
      "is_blurry": false,
      "confidence_level": "high"
    }
  },
  "extracted_data": {
    "transaction_date": "2025-09-15",
    "transaction_amount": "$100.00"
  }
}

Error Response

{
  "error": "Error description",
  "returncode": 1,
  "stdout": "...",
  "stderr": "..."
}

🔧 Configuration

Environment Variables (.env)

Create a .env file in the project root to configure AWS credentials and API endpoints:

# AWS Credentials (choose one method)
# Method 1: Direct credentials (for development)
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here

# Method 2: Use AWS CLI profile (recommended for production)
# Leave AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY empty to use AWS CLI profile
# Configure with: aws configure --profile your-profile-name

# Lambda API Endpoints (optional - for testing deployed APIs)
ANALYZE_API_URL=https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze
ANALYZE_HEALTH_API_URL=https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/health

# Runtime Configuration
BEDROCK_MODEL=amazon.nova-lite-v1:0  # Change to desired Bedrock model
LAMBDA_RUNTIME=false                 # Set to 'true' to simulate Lambda environment locally
AWS_REGION=us-east-1                 # Default AWS region
# AWS_PROFILE=default                # AWS CLI profile to use

Environment Variable Details:

Variable	Required	Description	Example
`AWS_ACCESS_KEY_ID`	✅	AWS access key for authentication	`AKIAIOSFODNN7EXAMPLE`
`AWS_SECRET_ACCESS_KEY`	✅	AWS secret key for authentication	`wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`
`ANALYZE_API_URL`	❌	Lambda API endpoint for document analysis	`https://abc123.execute-api.us-east-1.amazonaws.com/dev/analyze`
`ANALYZE_HEALTH_API_URL`	❌	Lambda API health check endpoint	`https://abc123.execute-api.us-east-1.amazonaws.com/dev/health`
`LAMBDA_RUNTIME`	❌	Simulate Lambda environment locally	`true` or `false` (default: `false`)
`AWS_REGION`	❌	AWS region for services	`us-east-1`, `us-west-2`, etc.
`AWS_PROFILE`	❌	AWS CLI profile name	`default`, `dev`, `prod`

Setup Instructions:

Copy the template above to create your .env file
Choose authentication method:
- Direct credentials: Fill in AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
- AWS CLI profile: Leave credentials empty, set AWS_PROFILE to your profile name
Configure API URLs (only needed for testing deployed Lambda functions)
Set region and other optional variables as needed

Security Notes:

Never commit .env to version control (it's already in .gitignore)
Use AWS CLI profiles for production environments
Rotate credentials regularly and use IAM roles when possible
Use least-privilege permissions (see AWS Permissions Required below)

AWS Permissions Required

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:AnalyzeDocument",
        "bedrock:InvokeModel"
      ],
      "Resource": "*"
    }
  ]
}

Supported File Types

PDF: Up to 11 pages, max 5 MB
JPEG/JPG: Max 5 MB
PNG: Max 5 MB

Lambda Limitations

Request Size: 6 MB (affects base64 file uploads)
Timeout: 5 minutes maximum
Memory: Configurable up to 10 GB
Blur Detection: Uses Textract confidence analysis (no OpenCV)

📝 Advanced Usage

Custom Queries (`--queries`)

Provide custom questions for Textract to answer about the document:

# Single query
uv run python cli.py --file document.pdf --mode q --queries "What is the total amount?"

# Multiple queries (semicolon or newline separated)
uv run python cli.py --file document.pdf --mode q --queries "What is the date?;What is the amount?;Who is the recipient?"

# Multiline format
uv run python cli.py --file document.pdf --mode q --queries "What is the transaction date?
What is the reference number?
What is the beneficiary name?"

Query Best Practices:

Ask specific, direct questions
Use clear, simple language
Questions should be answerable from visible text
Avoid overly complex or interpretive questions

Custom Prompts (`--prompt`)

Provide custom instructions for Bedrock AI to extract structured data:

# Basic extraction
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract all monetary amounts and dates as JSON"

# Structured JSON extraction
uv run python cli.py --file receipt.pdf --mode tfb --prompt "Extract: {\"merchant\": \"store name\", \"total\": \"amount as number\", \"date\": \"YYYY-MM-DD format\"}"

# Bank receipt extraction
uv run python cli.py --file bank-receipt.pdf --mode tfb --prompt "Extract transaction details: amount, date, beneficiary name, reference ID as JSON"

Prompt Guidelines:

Specify desired output format (JSON recommended)
Define field names and data types
Include formatting instructions (date formats, etc.)
Specify how to handle missing data (use null)

Argument Combinations

# Auto-detection with custom queries
uv run python cli.py --file document.pdf --mode tfbq --queries "Additional question?"

# Explicit category with custom prompt
uv run python cli.py --file document.pdf --mode tfb --category receipt --prompt "Custom extraction prompt"

# Custom mode overriding category files
uv run python cli.py --file document.pdf --mode tfbq --category license --custom --queries "Custom questions" --prompt "Custom prompt"

# Bank receipt with required custom content
uv run python cli.py --file bank-receipt.pdf --mode q --category bank-receipt --custom --queries "What is the transaction amount?"

API Response Structure

{
  "status": "success",
  "console_output": "Processing log...",
  "text": [{"text": "...", "confidence": 99.5}],
  "forms": {"key": "value"},
  "tables": [{"headers": [], "rows": []}],
  "queries": {"question": "answer"},
  "blur_analysis": {
    "laplacian": {"score": 4743.32, "is_blurry": false, "quality": "sharp"},
    "textract_analysis": {"median_confidence": 96.84, "quality_assessment": "excellent"},
    "overall_assessment": {"is_blurry": false, "confidence_level": "high"}
  },
  "extracted_data": {"field": "value"}
}

Environment Variables

export AWS_REGION=us-east-1
export AWS_PROFILE=default

Testing Auto-Detection

# Run the test script
uv run python test_auto_detection.py

# Manual testing
uv run python cli.py --file media/license.jpeg --mode tfbq
# Check log/{filename}_{timestamp}/category_detection.json for results

📝 Developer Guide: Custom Queries and Prompts

Writing Custom Queries

Queries are questions that Textract will attempt to answer based on the document content.

Query Best Practices:

Be Specific: Ask for exact information you need

✅ Good: "What is the expiry date?"
❌ Avoid: "What are the dates?"

Use Clear Language: Simple, direct questions work best

✅ Good: "What is the full name?"
✅ Good: "What is the license class?"
✅ Good: "What is the address?"

Avoid Duplicates: Don't repeat queries from category files

# Check existing queries first
cat src/queries/license.txt

Format Correctly: Separate multiple queries with semicolons or new lines

# Using semicolons
--queries "What is the license number?;What is the transaction amount?;What is the account number?"

# Using new lines (in scripts or multi-line input)
--queries "What is the license number?
What is the transaction amount?
What is the account number?"

Actual Query Examples by Document Type:

Driver's License (src/queries/license.txt):

What is the date of birth?
What is the expiry date?
What is the license validity period?
What is the license number?
What is the string below license number?
What is the license class?
What is the address?

License Front (src/queries/license-front.txt):

what is the identity No.?
What is the date of birth?
what is the nationality?
What is the license class?
What is the license validity period?
What is the address?

License Back (src/queries/license-back.txt):

What is the license number?

Receipt (src/queries/receipt.txt):

Who is the beneficiary?
What is the beneficiary account number?
Which bank is receiving the payment?
What is the recipient reference?
What is the reference ID?
What are the payment details / description?
What is the transaction amount?
When was the transaction successfully completed?

ID Card (src/queries/idcard.txt):

What is the full name?
What is the ID number?
What is the address?
What is the gender?

TNB Bill (src/queries/tnb.txt):

What is the No. Akaun (account number)?
What is the No. Invois (invoice number)?

Writing Custom Prompts

Prompts are used by Bedrock AI for structured data extraction in src/prompts/{category}.txt.

Actual Prompt Examples:

Driver's License (src/prompts/license.txt):

Extract Malaysian driving license fields from the provided data.

Return STRICTLY valid JSON matching this schema:
{
  "full_name": string|null,
  "identity_no": string|null,
  "license_number": string|null,
}

CRITICAL RULES:
- ONLY extract data that is EXPLICITLY present in the input
- DO NOT make up or guess any values
- If a field is not found, use null
- Full name: ONLY if explicitly found in the data
- Identity number: ONLY from "No. Pengenalan / Identity No." field
- license number: ONLY a combination of 2 parts
  * first part, 7-digit numeric codes that are clearly license numbers (NOT dates, NOT identity numbers), e.g. "1234567"
  * second part, 8-digit alphanumeric codes that are a randomised mix of upper/lowercase letters and/or numbers, e.g. "AbC12xYz"
  * join the two parts with a space in between, e.g. "1234567 AbC12xYz"
- Return only valid JSON, no explanations

Receipt (src/prompts/receipt.txt):

Extract the following information from this Payment Receipt text and return as JSON:

{
  "transaction_amount": "Transaction amount as displayed",
  "transaction_type": "Transaction type as printed",
  "merchant": "Merchant name as printed",
  "payment_method": "Payment method as printed",
  "date_time": "Date and time of transaction",
  "wallet_reference": "Wallet reference number as printed",
  "transaction_status": "Transaction status as printed",
  "transaction_number": "Transaction number as printed"
}

Rules:
- Return only valid JSON.
- Use the exact text as printed, don't interpret.
- If any field is missing, return null instead of skipping.
- For "transaction_amount", always return a positive value.

ID Card (src/prompts/idcard.txt):

Extract the following information from this ID card text and return as JSON:

{
  "full_name": "Full name of the ID holder",
  "userId": "ID card number",
  "gender": "Gender if available",
  "address": "Full address if available",
}

Rules:
- Use null for missing information
- Extract exact text, don't interpret
- Return only valid JSON

TNB Bill (src/prompts/tnb.txt):

Extract the following information from this TNB Bill text and return as JSON:

{
  "account_number": "Exact value of No. Akaun as printed on the bill",
  "invoice_number": "Exact value of No. Invois as printed on the bill",
}

Rules:
- Use null if No. Akaun is not found
- Extract exact text, don't interpret or reformat
- Return only valid JSON

Testing Custom Queries and Prompts:

# Custom queries only
uv run python cli.py --file document.pdf --mode q --queries "Your question?"

# Category + custom queries
uv run python cli.py --file document.pdf --mode tfbq --category license --queries "Additional question?"

# Custom prompt for AI extraction
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract specific fields as JSON"

# Via Lambda API with custom prompt
uv run python test_lambda.py --file document.pdf --prompt "Your custom prompt" --api-url YOUR_URL

Custom Prompt Engineering

The --prompt parameter allows you to override category-based prompts for Bedrock AI extraction, enabling rapid prototyping and adaptation to new document types.

Custom Prompt Examples:

Simple Extraction:

--prompt "Extract the name, date, and amount from this document and return as JSON."

Structured JSON Output:

--prompt "Extract the following information and return as JSON:
{
  \"document_type\": \"type of document\",
  \"issuer\": \"issuing organization\",
  \"recipient\": \"recipient name\",
  \"date_issued\": \"date in YYYY-MM-DD format\",
  \"amount\": \"monetary amount as number\",
  \"reference_number\": \"reference or ID number\"
}"

Receipt Analysis:

--prompt "Analyze this receipt and extract:
{
  \"merchant\": \"store name\",
  \"date\": \"transaction date\",
  \"total\": \"total amount\",
  \"tax\": \"tax amount\",
  \"items\": [\"list of purchased items\"]
}
Return only valid JSON."

Invoice Processing:

--prompt "Extract invoice details as JSON:
{
  \"invoice_number\": \"invoice ID\",
  \"vendor\": \"vendor name\",
  \"customer\": \"customer name\",
  \"date\": \"invoice date\",
  \"due_date\": \"payment due date\",
  \"subtotal\": \"subtotal amount\",
  \"tax_rate\": \"tax percentage\",
  \"total\": \"total amount\"
}"

Prompt Best Practices:

Specify Output Format: Always request JSON for structured data
Define Field Names: Use clear, consistent field names
Handle Missing Data: Instruct to use null for missing information
Format Guidelines: Specify date formats, number formats, etc.
Validation Rules: Add constraints for better accuracy

�🐛 Troubleshooting

Common Issues

Local Development

# Module not found errors
uv sync

# AWS credentials not configured
aws configure

# Permission denied errors
aws sts get-caller-identity

Lambda Deployment

# Serverless deployment fails
npm install -g serverless serverless-python-requirements

# Function timeout
# Increase timeout in serverless.yml or use smaller files

# Memory errors
# Increase memory allocation in serverless.yml

API Testing

# Test local Lambda function
uv run python local_test.py

# Test deployed API
uv run python test_lambda.py --api-url YOUR_URL --file media/license.jpeg --mode t

# Check API Gateway logs
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/document-ingestion-and-text-extraction-api

File Size Issues

Local: No size limit (within AWS service limits)
Lambda: 6 MB request limit for base64 encoded files (~4.5 MB original file)

Performance Tips

Use --mode t for fastest processing (text only)
Smaller files process faster
Lambda has cold start delay (~1-3 seconds)

📊 Comparison: Local vs Lambda

Feature	Local CLI	Lambda API
Deployment	No deployment needed	Serverless deployment
Scaling	Single instance	Auto-scaling
File Upload	Direct file path	Base64 in JSON
AWS Credentials	Local AWS config	IAM role
Blur Detection	Full OpenCV analysis	Textract confidence analysis + API field
Timeout	No limit	5 minutes
File Size	AWS service limits	6 MB request limit
Cost	Compute + AWS services	Lambda + AWS services
Cold Start	None	1-3 seconds

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test both local and Lambda versions
Submit a pull request

📄 License

This project is licensed under the MIT License.

Happy Document Analysis! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
media		media
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
api.py		api.py
cli.py		cli.py
deploy_lambda.py		deploy_lambda.py
lambda_handler.py		lambda_handler.py
local_test.py		local_test.py
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
serverless.yml		serverless.yml
test_api.py		test_api.py
test_lambda.py		test_lambda.py
uv.lock		uv.lock

MyGovHub-Goodbye-World/document-ingestion-and-text-extraction

Folders and files

Latest commit

History

Repository files navigation