Skip to content

AI-powered document analysis service combining AWS Textract, Bedrock, and intelligent blur detection. Supports CLI and serverless Lambda API for Malaysian documents (licenses, receipts, ID cards, utility bills).

Notifications You must be signed in to change notification settings

MyGovHub-Goodbye-World/document-ingestion-and-text-extraction

Repository files navigation

Document Ingestion and Text Extraction Service

A comprehensive document analysis tool that combines AWS Textract, Bedrock, and intelligent blur detection. Available as both CLI and serverless Lambda API.

Β© 2025 Goodbye World team, for Great AI Hackathon Malaysia 2025 usage.

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • uv package manager
  • AWS CLI configured with appropriate permissions

Installation

# Clone and setup
cd document-ingestion-and-text-extraction
uv sync

Useful UV Commands for Environment Management

# Activate the virtual environment
.venv\Scripts\activate

# Check which Python is being used
uv run python --version

# Show environment information
uv run python -c "import sys; print(sys.executable)"

# Create a new virtual environment (if needed)
uv venv

# Sync dependencies
uv sync

# Sync with active environment (ignores conflicting VIRTUAL_ENV)
uv sync --active

Basic Usage

# Analyze a driver's license locally
uv run python cli.py --file media/license.jpeg --mode tfbq --category license

# Test Lambda function locally
uv run python local_test.py

# Test deployed Lambda API
uv run python test_lambda.py --api-url YOUR_API_URL --file media/license.jpeg --mode tfbq --category license

# Test deployed Lambda API with category auto-detection
uv run python test_lambda.py --api-url YOUR_API_URL --file media/license.jpeg

πŸ“‹ Table of Contents

πŸ’» Local CLI Usage

Command Line Interface

# Basic syntax
uv run python cli.py --file <path> --mode <mode> [options]

Arguments

Argument Description Default Required
--file Path to input file (JPEG/PNG/PDF) - βœ…
--mode Analysis mode: t(ext), f(orms), b(tables), q(uery) tfbq ❌
--category Document type: idcard, license, license-front, license-back, tnb, receipt (auto-detected if not provided) - ❌
--queries Custom queries separated by semicolons or newlines - ❌
--prompt Custom prompt for Bedrock AI extraction - ❌
--custom Use custom queries/prompts even if category files exist False ❌
--region AWS region us-east-1 ❌
--profile AWS profile name default ❌

Common Local Commands

# Full analysis with auto-detection (no category needed)
uv run python cli.py --file media/license.jpeg --mode tfbq --region us-east-1

# Full analysis of a driver's license (explicit category)
uv run python cli.py --file media/license.jpeg --mode tfbq --category license --region us-east-1

# TNB utility bill analysis
uv run python cli.py --file media/tnb-bill.pdf --mode tfbq --category tnb --region us-east-1

# License front side analysis
uv run python cli.py --file media/license-front.jpeg --mode tfbq --category license-front --region us-east-1

# Text extraction only with blur detection
uv run python cli.py --file media/license.jpeg --mode t --region us-east-1

# Forms and tables analysis
uv run python cli.py --file media/license.jpeg --mode fb --region us-east-1

# Auto-detection with custom queries/prompts
uv run python cli.py --file media/license.jpeg --mode tfbq --custom --queries "What is the issuing authority?" --region us-east-1

☁️ Lambda API

Deployment

Option 1: Serverless Framework (Recommended)

# Install dependencies
npm install -g serverless serverless-python-requirements

set BEDROCK_MODEL=amazon.nova-lite-v1:0 # change to your desired model

# Deploy to AWS
serverless deploy --region us-east-1

Option 2: Manual Deployment

# Create deployment package
python deploy_lambda.py --function-name document-ingestion-and-text-extraction-api --region us-east-1

Testing Lambda

Local Testing

# Test Lambda function locally (simulates Lambda environment)
uv run python local_test.py

# Test health endpoint
uv run python local_test.py --health

Remote Testing

# Test deployed API
uv run python test_lambda.py --api-url https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze --file media/receipt.pdf --mode tfbq --category receipt

# Create web test interface
uv run python test_lambda.py --create-html
# Then open test_lambda.html in browser

LAMBDA RUNTIME

By default, the CLI runs in local mode with OpenCV support for blur detection. To simulate the Lambda environment without OpenCV, set the following environment variable:

SET LAMBDA_RUNTIME = false

For Lambda-like behavior (Textract confidence-based blur detection only), set:

SET LAMBDA_RUNTIME = true

Lambda API Usage

JSON Request Format

curl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "file_content": "<base64-encoded-file-content>",
    "filename": "document.pdf",
    "mode": "tfbq",
    "custom": false,
    "region": "us-east-1"
  }'

Note: May output The input line is too long..

OR using Python script:

uv run python test_api.py

Note: Update base64_content in test_api.py with your base64-encoded file content.

Health Check

curl https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/health

🎯 Features

1. AWS Textract Integration

  • Text Detection: Extract text with confidence scores
  • Form Analysis: Key-value pair extraction
  • Table Analysis: Structured table data extraction
  • Query Analysis: Answer specific questions about documents
  • Auto Category Detection: Automatically detect document type using AI

2. Intelligent Blur Detection

  • Local: OpenCV Laplacian variance + Textract confidence analysis
  • Lambda: Enhanced Textract confidence analysis with statistical metrics
  • Metrics: Average, median, and standard deviation of OCR confidence
  • Quality Assessment: Excellent, good, fair, or poor ratings
  • API Integration: Structured blur_analysis field in Lambda responses

3. AWS Bedrock Integration

  • Structured Extraction: Convert documents to structured JSON
  • Document Categories: Specialized prompts for different document types
  • Auto Category Detection: AI-powered document classification
  • Custom Mode: Override category-based prompts and queries
  • AI-Powered: Uses Claude AI for intelligent data extraction

4. Dual Deployment Options

  • Local CLI: Full-featured command-line interface
  • Lambda API: Serverless REST API with automatic scaling
  • Consistent Results: Same analysis quality in both environments

πŸ“ Project Structure

document-ingestion-and-text-extraction/
β”œβ”€β”€ src/                          # Core source code
β”‚   β”œβ”€β”€ __init__.py               # Package initialization
β”‚   β”œβ”€β”€ main.py                   # Main CLI logic
β”‚   β”œβ”€β”€ textract_enhanced.py      # Textract integration
β”‚   β”œβ”€β”€ bedrock_mapper.py         # Bedrock integration
β”‚   β”œβ”€β”€ category_detector.py      # Auto-detection logic
β”‚   β”œβ”€β”€ blur_detection.py         # Blur detection logic
β”‚   β”œβ”€β”€ logger.py                 # Logging utilities
β”‚   β”œβ”€β”€ sample_response.json      # Sample API response for reference
β”‚   β”œβ”€β”€ prompts/                  # Bedrock prompts
β”‚   β”‚   β”œβ”€β”€ idcard.txt
β”‚   β”‚   β”œβ”€β”€ license.txt
β”‚   β”‚   β”œβ”€β”€ license-front.txt
β”‚   β”‚   β”œβ”€β”€ license-back.txt
β”‚   β”‚   β”œβ”€β”€ receipt.txt
β”‚   β”‚   └── tnb.txt
β”‚   └── queries/                  # Textract queries
β”‚       β”œβ”€β”€ idcard.txt
β”‚       β”œβ”€β”€ license.txt
β”‚       β”œβ”€β”€ license-front.txt
β”‚       β”œβ”€β”€ license-back.txt
β”‚       β”œβ”€β”€ receipt.txt
β”‚       └── tnb.txt
β”œβ”€β”€ media/                        # Sample test files
β”‚   β”œβ”€β”€ blur.jpg                  # Blurry test image
β”‚   β”œβ”€β”€ exceed-5mb.pdf            # Large file test
β”‚   β”œβ”€β”€ exceed-pages.pdf          # Multi-page test
β”‚   β”œβ”€β”€ half-blur.jpg             # Partially blurred image
β”‚   β”œβ”€β”€ license.jpeg              # Driver's license sample
β”‚   β”œβ”€β”€ mingjia-license.jpg       # License sample
β”‚   β”œβ”€β”€ receipt.pdf               # Receipt sample
β”‚   β”œβ”€β”€ tnb.png                   # TNB utility bill sample
β”‚   └── unsupported-file-type.xlsx # Unsupported format test
β”œβ”€β”€ log/                          # Local analysis results
β”‚   └── {filename}_{timestamp}/   # Individual analysis logs
β”‚       β”œβ”€β”€ textract.log          # Complete processing log
β”‚       β”œβ”€β”€ text.json             # Text detection results
β”‚       β”œβ”€β”€ forms.json            # Form analysis results
β”‚       β”œβ”€β”€ tables.json           # Table analysis results
β”‚       β”œβ”€β”€ queries.json          # Query analysis results
β”‚       β”œβ”€β”€ blur_analysis.json    # Blur detection results
β”‚       └── category_detection.json # Auto-detection results
β”œβ”€β”€ output/                       # Extracted structured data
β”œβ”€β”€ .env                          # Environment variables (not tracked)
β”œβ”€β”€ .gitignore                    # Git ignore patterns
β”œβ”€β”€ api.py                        # Standalone API server
β”œβ”€β”€ cli.py                        # CLI entry point
β”œβ”€β”€ lambda_handler.py             # Lambda function handler
β”œβ”€β”€ local_test.py                 # Local Lambda testing
β”œβ”€β”€ test_api.py                   # API testing with requests
β”œβ”€β”€ test_lambda.py                # Lambda API testing
β”œβ”€β”€ test_lambda.html              # Web-based Lambda testing interface (Generate with `python test_lambda.py --create-html`)
β”œβ”€β”€ deploy_lambda.py              # Deployment script
β”œβ”€β”€ serverless.yml                # Serverless Framework config
β”œβ”€β”€ package.json                  # Node.js dependencies for Serverless
β”œβ”€β”€ package-lock.json             # Node.js lock file
β”œβ”€β”€ pyproject.toml                # Project dependencies
β”œβ”€β”€ requirements.txt              # Lambda-specific dependencies
β”œβ”€β”€ uv.lock                       # UV package manager lock file
└── README.md                     # This file

πŸš€ Quick Reference & Commands

Essential Commands

# Basic Setup
uv sync                                    # Install/update dependencies

# Local Analysis (Recommended)
uv run python cli.py --file media/license.jpeg --mode tfbq    # Auto-detect + full analysis
uv run python cli.py --file media/receipt.pdf --mode tfbq     # Receipt analysis
uv run python cli.py --file media/document.pdf --mode t       # Text extraction only

# Lambda Testing
uv run python local_test.py                                   # Test Lambda locally
serverless deploy --region us-east-1                          # Deploy to AWS
uv run python test_lambda.py --api-url YOUR_URL --file media/license.jpeg

Analysis Modes

Mode Description Use Case Speed
t Text detection only Quick text extraction ⚑⚑⚑
f Forms analysis Key-value pairs ⚑⚑
b Tables analysis Structured table data ⚑⚑
q Query analysis Answer specific questions ⚑
tfbq All analysis types Complete document analysis (recommended) ⚑

Document Categories

Category Documents Auto-Detect Manual Specify
Auto All supported documents βœ… Recommended --mode tfbq
license Driver's licenses (any side) βœ… --category license
license-front License front side only βœ… --category license-front
license-back License back side only βœ… --category license-back
idcard ID cards, national IDs βœ… --category idcard
receipt Purchase receipts βœ… --category receipt
tnb TNB utility bills βœ… --category tnb

Command Examples by Use Case

# Quick Analysis (Most Common)
uv run python cli.py --file document.pdf --mode tfbq          # Full auto-analysis
uv run python cli.py --file document.pdf --mode t            # Text only (fastest)

# Specific Document Types
uv run python cli.py --file license.jpg --mode tfbq --category license
uv run python cli.py --file receipt.pdf --mode tfbq --category receipt
uv run python cli.py --file bill.pdf --mode tfbq --category tnb

# Custom Analysis
uv run python cli.py --file document.pdf --mode q --queries "What is the date?;What is the amount?"
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract as JSON: name, date, amount"

# API Testing
uv run python local_test.py                                   # Test locally
uv run python test_lambda.py --create-html                    # Create web interface

🌐 API Usage Examples

JavaScript (Browser)

// File upload and analysis
const fileInput = document.getElementById('file');
const file = fileInput.files[0];
const reader = new FileReader();

reader.onload = async function(e) {
  const fileContent = e.target.result.split(',')[1];

  const response = await fetch('https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      file_content: fileContent,
      filename: file.name,
      mode: 'tfbq',
      custom: False,
      region: 'us-east-1'
    })
  });

  const result = await response.json();

  // Access blur analysis
  if (result.blur_analysis) {
    const blur = result.blur_analysis;
    const textract = blur.textract_analysis;
    const overall = blur.overall_assessment;

    console.log(`Quality: ${textract.quality_assessment}`);
    console.log(`Is Blurry: ${overall.is_blurry}`);
    console.log(`Confidence: ${overall.confidence_level}`);
    console.log(`Median Confidence: ${textract.median_confidence.toFixed(2)}%`);

    // Quality-based processing
    if (textract.quality_assessment === 'excellent') {
      console.log('High quality image - proceed with confidence');
    } else if (overall.is_blurry) {
      console.log('Blurry image detected - results may be less accurate');
    }
  }
};

reader.readAsDataURL(file);

Python (Requests)

import requests
import base64

# Read and encode file
with open('media/license.jpeg', 'rb') as f:
    file_content = base64.b64encode(f.read()).decode('utf-8')

# Make API call
response = requests.post(
    'https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze',
    json={
        'file_content': file_content,
        'filename': 'license.jpeg',
        'mode': 'tfbq',
        'custom': False,
        'region': 'us-east-1'
    }
)

result = response.json()

# Access blur analysis
if 'blur_analysis' in result:
    blur_info = result['blur_analysis']
    print(f"Quality: {blur_info['textract_analysis']['quality_assessment']}")
    print(f"Is Blurry: {blur_info['overall_assessment']['is_blurry']}")
    print(f"Confidence: {blur_info['overall_assessment']['confidence_level']}")

cURL

# Encode file to base64
FILE_CONTENT=$(base64 -w 0 media/license.jpeg)

# Make API call
curl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze \
  -H "Content-Type: application/json" \
  -d "{
    \"file_content\": \"$FILE_CONTENT\",
    \"filename\": \"license.jpeg\",
    \"mode\": \"tfbq\",
    \"custom\": false,
    \"region\": \"us-east-1\"
  }"

πŸ“Š Blur Analysis API Field

The Lambda API now includes a dedicated blur_analysis field that provides comprehensive image quality assessment:

{
  "blur_analysis": {
    "laplacian": {
      "method": "laplacian",
      "score": 4743.317898724083,
      "is_blurry": false,
      "quality": "sharp"
    },
    "textract_analysis": {
      "total_items": 53,
      "min_confidence": 34.21656036376953,
      "max_confidence": 99.990234375,
      "median_confidence": 96.84188079833984,
      "average_confidence": 95.69545777638753,
      "std_confidence": 11.519045489650455,
      "low_confidence_count": 22,
      "low_confidence_percentage": 41.509433962264154,
      "likely_blurry": false,
      "quality_assessment": "excellent"
    },
    "overall_assessment": {
      "is_blurry": false,
      "blur_indicators": [],
      "confidence_level": "high"
    }
  }
}

Blur Analysis Complete Structure

Laplacian Analysis (Local only, defaults in Lambda)

Field Type Description Values
method string Analysis method used "laplacian"
score float Laplacian variance score 0.0+ (higher = sharper)
is_blurry boolean Laplacian-based blur detection true, false
quality string Laplacian-based quality assessment "sharp", "moderate", "blurry"

Textract Analysis

Field Type Description Values
total_items integer Number of text items detected 0+
min_confidence float Lowest confidence score 0.0 - 100.0
max_confidence float Highest confidence score 0.0 - 100.0
median_confidence float Median confidence score 0.0 - 100.0
average_confidence float Average confidence score 0.0 - 100.0
std_confidence float Standard deviation of confidence scores 0.0+
low_confidence_count integer Number of items below 85% confidence 0+
low_confidence_percentage float Percentage of low-confidence items 0.0 - 100.0
likely_blurry boolean Textract-based blur assessment true, false
quality_assessment string Overall quality rating "excellent", "good", "fair", "poor"

Overall Assessment

Field Type Description Values
is_blurry boolean Final blur detection result true, false
blur_indicators array Methods that detected blur [], ["textract"], ["laplacian"], ["laplacian", "textract"]
confidence_level string Confidence in the assessment "high", "medium", "low"

Blur Detection Algorithm

Quality Assessment Criteria

Quality Median Confidence Average Confidence Description
Excellent > 95% > 90% Very high quality, clear text
Good > 90% > 85% Good quality, readable text
Fair > 85% > 80% Acceptable quality, mostly readable
Poor ≀ 85% ≀ 80% Poor quality, difficult to read

Blur Detection Logic

An image is considered blurry if ANY of these conditions are met:

  1. Very low median confidence: median_confidence < 80.0
  2. Very low average confidence: average_confidence < 75.0
  3. High percentage of poor items: low_confidence_percentage > 50.0 (>50% below 85%)
  4. Extreme inconsistency with poor quality: std_confidence > 20.0 AND median_confidence < 85.0

Confidence Level Determination

Level Median Average Low Conf % Description
High > 95% > 90% < 20% Very confident assessment
High > 90% > 85% < 35% Confident assessment
Medium > 85% > 80% < 50% Moderately confident
Low ≀ 85% ≀ 80% β‰₯ 50% Low confidence assessment

Blur Indicators Interpretation

Indicators Meaning Confidence
[] No blur detected by any method High
["textract"] Only confidence analysis detected blur Medium
["laplacian"] Only image analysis detected blur (local only) Medium
["laplacian", "textract"] Both methods detected blur (local only) Very High

πŸ”„ Auto-Detection Features

Document Category Detection

The system automatically detects document categories using AI analysis of extracted text, forms, and tables:

# Auto-detection (recommended - no --category needed)
uv run python cli.py --file document.pdf --mode tfbq

How it works:

  1. Initial Analysis: Extracts text, forms, and tables using Textract
  2. AI Classification: Uses Bedrock Claude AI to analyze content and classify document type
  3. Category Assignment: Applies detected category for queries and prompts
  4. Results Saved: Detection results saved to category_detection.json

Supported Categories:

  • idcard - Identity cards, national IDs, employee IDs
  • license - Driver's license, driving permits (combined/single-sided)
  • license-front - Front side of driver's license specifically
  • license-back - Back side of driver's license specifically
  • tnb - TNB utility bills, electricity bills
  • receipt - Purchase receipts, invoices from retail stores

Detection Confidence:

  • High confidence (0.7-1.0): Very reliable classification
  • Medium confidence (0.4-0.7): Moderately reliable
  • Low confidence (0.0-0.4): Less reliable, may need manual verification

Manual Category Override

# Specify category explicitly (skips auto-detection)
uv run python cli.py --file document.pdf --mode tfbq --category tnb

Custom Mode

Use --custom to override category-based files with your own queries/prompts:

# Custom mode with explicit queries (ignores category query files)
uv run python cli.py --file document.pdf --mode q --custom --queries "What is the date?;What is the amount?"

# Custom mode with explicit prompt (ignores category prompt files)
uv run python cli.py --file document.pdf --mode tfb --custom --prompt "Extract all dates as JSON"

# Custom mode for categories without extensive default files
uv run python cli.py --file license-back.jpeg --mode q --category license-back --queries "What is the license number?"

Custom Mode Rules:

  • If --custom is used and no custom queries/prompts provided, system checks for category files
  • All new categories have supporting files, but --custom can override them
  • Use custom mode to test new queries or prompts for existing categories

πŸ“š API Reference

Local CLI Response

=== TEXT DETECTION ===
text = "Sample Text" | confidence = 99.89

=== FORM ANALYSIS ===
Key: Value pairs extracted from forms

=== TABLE ANALYSIS ===
Structured table data with rows and columns

=== QUERY ANALYSIS ===
Q: What is the transaction amount?
A: $100.00

=== BLUR DETECTION ===
Textract confidence - Median: 99.89, Avg: 99.75, Std: 0.51
Quality assessment: excellent
Overall: CLEAR (confidence: high)

=== BEDROCK EXTRACTION ===
{
  "transaction_date": "2025-09-15",
  "transaction_amount": "$100.00",
  "beneficiary_name": "John Doe"
}

Lambda API Response

{
  "status": "success",
  "console_output": "Processing log...",
  "text": [{ "text": "Sample Text", "confidence": 99.89 }],
  "forms": {
    "Key": ["Value"]
  },
  "tables": {
    "tables": [{ "table_id": 1, "rows": [["Cell1", "Cell2"]] }]
  },
  "queries": {
    "What is the amount?": "$100.00"
  },
  "category_detection": {
    "detected_category": "receipt",
    "confidence": 0.95,
    "timestamp": "2025-09-15T12:34:56+00:00"
  },
  "blur_analysis": {
    "textract_analysis": {
      "median_confidence": 99.89,
      "average_confidence": 99.62,
      "std_confidence": 0.51,
      "quality_assessment": "excellent"
    },
    "overall_assessment": {
      "is_blurry": false,
      "confidence_level": "high"
    }
  },
  "extracted_data": {
    "transaction_date": "2025-09-15",
    "transaction_amount": "$100.00"
  }
}

Error Response

{
  "error": "Error description",
  "returncode": 1,
  "stdout": "...",
  "stderr": "..."
}

πŸ”§ Configuration

Environment Variables (.env)

Create a .env file in the project root to configure AWS credentials and API endpoints:

# AWS Credentials (choose one method)
# Method 1: Direct credentials (for development)
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here

# Method 2: Use AWS CLI profile (recommended for production)
# Leave AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY empty to use AWS CLI profile
# Configure with: aws configure --profile your-profile-name

# Lambda API Endpoints (optional - for testing deployed APIs)
ANALYZE_API_URL=https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze
ANALYZE_HEALTH_API_URL=https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/health

# Runtime Configuration
BEDROCK_MODEL=amazon.nova-lite-v1:0  # Change to desired Bedrock model
LAMBDA_RUNTIME=false                 # Set to 'true' to simulate Lambda environment locally
AWS_REGION=us-east-1                 # Default AWS region
# AWS_PROFILE=default                # AWS CLI profile to use

Environment Variable Details:

Variable Required Description Example
AWS_ACCESS_KEY_ID βœ… AWS access key for authentication AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY βœ… AWS secret key for authentication wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
ANALYZE_API_URL ❌ Lambda API endpoint for document analysis https://abc123.execute-api.us-east-1.amazonaws.com/dev/analyze
ANALYZE_HEALTH_API_URL ❌ Lambda API health check endpoint https://abc123.execute-api.us-east-1.amazonaws.com/dev/health
LAMBDA_RUNTIME ❌ Simulate Lambda environment locally true or false (default: false)
AWS_REGION ❌ AWS region for services us-east-1, us-west-2, etc.
AWS_PROFILE ❌ AWS CLI profile name default, dev, prod

Setup Instructions:

  1. Copy the template above to create your .env file
  2. Choose authentication method:
    • Direct credentials: Fill in AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
    • AWS CLI profile: Leave credentials empty, set AWS_PROFILE to your profile name
  3. Configure API URLs (only needed for testing deployed Lambda functions)
  4. Set region and other optional variables as needed

Security Notes:

  • Never commit .env to version control (it's already in .gitignore)
  • Use AWS CLI profiles for production environments
  • Rotate credentials regularly and use IAM roles when possible
  • Use least-privilege permissions (see AWS Permissions Required below)

AWS Permissions Required

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:AnalyzeDocument",
        "bedrock:InvokeModel"
      ],
      "Resource": "*"
    }
  ]
}

Supported File Types

  • PDF: Up to 11 pages, max 5 MB
  • JPEG/JPG: Max 5 MB
  • PNG: Max 5 MB

Lambda Limitations

  • Request Size: 6 MB (affects base64 file uploads)
  • Timeout: 5 minutes maximum
  • Memory: Configurable up to 10 GB
  • Blur Detection: Uses Textract confidence analysis (no OpenCV)

πŸ“ Advanced Usage

Custom Queries (--queries)

Provide custom questions for Textract to answer about the document:

# Single query
uv run python cli.py --file document.pdf --mode q --queries "What is the total amount?"

# Multiple queries (semicolon or newline separated)
uv run python cli.py --file document.pdf --mode q --queries "What is the date?;What is the amount?;Who is the recipient?"

# Multiline format
uv run python cli.py --file document.pdf --mode q --queries "What is the transaction date?
What is the reference number?
What is the beneficiary name?"

Query Best Practices:

  • Ask specific, direct questions
  • Use clear, simple language
  • Questions should be answerable from visible text
  • Avoid overly complex or interpretive questions

Custom Prompts (--prompt)

Provide custom instructions for Bedrock AI to extract structured data:

# Basic extraction
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract all monetary amounts and dates as JSON"

# Structured JSON extraction
uv run python cli.py --file receipt.pdf --mode tfb --prompt "Extract: {\"merchant\": \"store name\", \"total\": \"amount as number\", \"date\": \"YYYY-MM-DD format\"}"

# Bank receipt extraction
uv run python cli.py --file bank-receipt.pdf --mode tfb --prompt "Extract transaction details: amount, date, beneficiary name, reference ID as JSON"

Prompt Guidelines:

  • Specify desired output format (JSON recommended)
  • Define field names and data types
  • Include formatting instructions (date formats, etc.)
  • Specify how to handle missing data (use null)

Argument Combinations

# Auto-detection with custom queries
uv run python cli.py --file document.pdf --mode tfbq --queries "Additional question?"

# Explicit category with custom prompt
uv run python cli.py --file document.pdf --mode tfb --category receipt --prompt "Custom extraction prompt"

# Custom mode overriding category files
uv run python cli.py --file document.pdf --mode tfbq --category license --custom --queries "Custom questions" --prompt "Custom prompt"

# Bank receipt with required custom content
uv run python cli.py --file bank-receipt.pdf --mode q --category bank-receipt --custom --queries "What is the transaction amount?"

API Response Structure

{
  "status": "success",
  "console_output": "Processing log...",
  "text": [{"text": "...", "confidence": 99.5}],
  "forms": {"key": "value"},
  "tables": [{"headers": [], "rows": []}],
  "queries": {"question": "answer"},
  "blur_analysis": {
    "laplacian": {"score": 4743.32, "is_blurry": false, "quality": "sharp"},
    "textract_analysis": {"median_confidence": 96.84, "quality_assessment": "excellent"},
    "overall_assessment": {"is_blurry": false, "confidence_level": "high"}
  },
  "extracted_data": {"field": "value"}
}

Environment Variables

export AWS_REGION=us-east-1
export AWS_PROFILE=default

Testing Auto-Detection

# Run the test script
uv run python test_auto_detection.py

# Manual testing
uv run python cli.py --file media/license.jpeg --mode tfbq
# Check log/{filename}_{timestamp}/category_detection.json for results

πŸ“ Developer Guide: Custom Queries and Prompts

Writing Custom Queries

Queries are questions that Textract will attempt to answer based on the document content.

Query Best Practices:

  1. Be Specific: Ask for exact information you need

    βœ… Good: "What is the expiry date?"
    ❌ Avoid: "What are the dates?"
    
  2. Use Clear Language: Simple, direct questions work best

    βœ… Good: "What is the full name?"
    βœ… Good: "What is the license class?"
    βœ… Good: "What is the address?"
    
  3. Avoid Duplicates: Don't repeat queries from category files

    # Check existing queries first
    cat src/queries/license.txt
  4. Format Correctly: Separate multiple queries with semicolons or new lines

    # Using semicolons
    --queries "What is the license number?;What is the transaction amount?;What is the account number?"
    
    # Using new lines (in scripts or multi-line input)
    --queries "What is the license number?
    What is the transaction amount?
    What is the account number?"

Actual Query Examples by Document Type:

Driver's License (src/queries/license.txt):

What is the date of birth?
What is the expiry date?
What is the license validity period?
What is the license number?
What is the string below license number?
What is the license class?
What is the address?

License Front (src/queries/license-front.txt):

what is the identity No.?
What is the date of birth?
what is the nationality?
What is the license class?
What is the license validity period?
What is the address?

License Back (src/queries/license-back.txt):

What is the license number?

Receipt (src/queries/receipt.txt):

Who is the beneficiary?
What is the beneficiary account number?
Which bank is receiving the payment?
What is the recipient reference?
What is the reference ID?
What are the payment details / description?
What is the transaction amount?
When was the transaction successfully completed?

ID Card (src/queries/idcard.txt):

What is the full name?
What is the ID number?
What is the address?
What is the gender?

TNB Bill (src/queries/tnb.txt):

What is the No. Akaun (account number)?
What is the No. Invois (invoice number)?

Writing Custom Prompts

Prompts are used by Bedrock AI for structured data extraction in src/prompts/{category}.txt.

Actual Prompt Examples:

Driver's License (src/prompts/license.txt):

Extract Malaysian driving license fields from the provided data.

Return STRICTLY valid JSON matching this schema:
{
  "full_name": string|null,
  "identity_no": string|null,
  "license_number": string|null,
}

CRITICAL RULES:
- ONLY extract data that is EXPLICITLY present in the input
- DO NOT make up or guess any values
- If a field is not found, use null
- Full name: ONLY if explicitly found in the data
- Identity number: ONLY from "No. Pengenalan / Identity No." field
- license number: ONLY a combination of 2 parts
  * first part, 7-digit numeric codes that are clearly license numbers (NOT dates, NOT identity numbers), e.g. "1234567"
  * second part, 8-digit alphanumeric codes that are a randomised mix of upper/lowercase letters and/or numbers, e.g. "AbC12xYz"
  * join the two parts with a space in between, e.g. "1234567 AbC12xYz"
- Return only valid JSON, no explanations

Receipt (src/prompts/receipt.txt):

Extract the following information from this Payment Receipt text and return as JSON:

{
  "transaction_amount": "Transaction amount as displayed",
  "transaction_type": "Transaction type as printed",
  "merchant": "Merchant name as printed",
  "payment_method": "Payment method as printed",
  "date_time": "Date and time of transaction",
  "wallet_reference": "Wallet reference number as printed",
  "transaction_status": "Transaction status as printed",
  "transaction_number": "Transaction number as printed"
}

Rules:
- Return only valid JSON.
- Use the exact text as printed, don't interpret.
- If any field is missing, return null instead of skipping.
- For "transaction_amount", always return a positive value.

ID Card (src/prompts/idcard.txt):

Extract the following information from this ID card text and return as JSON:

{
  "full_name": "Full name of the ID holder",
  "userId": "ID card number",
  "gender": "Gender if available",
  "address": "Full address if available",
}

Rules:
- Use null for missing information
- Extract exact text, don't interpret
- Return only valid JSON

TNB Bill (src/prompts/tnb.txt):

Extract the following information from this TNB Bill text and return as JSON:

{
  "account_number": "Exact value of No. Akaun as printed on the bill",
  "invoice_number": "Exact value of No. Invois as printed on the bill",
}

Rules:
- Use null if No. Akaun is not found
- Extract exact text, don't interpret or reformat
- Return only valid JSON

Testing Custom Queries and Prompts:

# Custom queries only
uv run python cli.py --file document.pdf --mode q --queries "Your question?"

# Category + custom queries
uv run python cli.py --file document.pdf --mode tfbq --category license --queries "Additional question?"

# Custom prompt for AI extraction
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract specific fields as JSON"

# Via Lambda API with custom prompt
uv run python test_lambda.py --file document.pdf --prompt "Your custom prompt" --api-url YOUR_URL

Custom Prompt Engineering

The --prompt parameter allows you to override category-based prompts for Bedrock AI extraction, enabling rapid prototyping and adaptation to new document types.

Custom Prompt Examples:

Simple Extraction:

--prompt "Extract the name, date, and amount from this document and return as JSON."

Structured JSON Output:

--prompt "Extract the following information and return as JSON:
{
  \"document_type\": \"type of document\",
  \"issuer\": \"issuing organization\",
  \"recipient\": \"recipient name\",
  \"date_issued\": \"date in YYYY-MM-DD format\",
  \"amount\": \"monetary amount as number\",
  \"reference_number\": \"reference or ID number\"
}"

Receipt Analysis:

--prompt "Analyze this receipt and extract:
{
  \"merchant\": \"store name\",
  \"date\": \"transaction date\",
  \"total\": \"total amount\",
  \"tax\": \"tax amount\",
  \"items\": [\"list of purchased items\"]
}
Return only valid JSON."

Invoice Processing:

--prompt "Extract invoice details as JSON:
{
  \"invoice_number\": \"invoice ID\",
  \"vendor\": \"vendor name\",
  \"customer\": \"customer name\",
  \"date\": \"invoice date\",
  \"due_date\": \"payment due date\",
  \"subtotal\": \"subtotal amount\",
  \"tax_rate\": \"tax percentage\",
  \"total\": \"total amount\"
}"

Prompt Best Practices:

  1. Specify Output Format: Always request JSON for structured data
  2. Define Field Names: Use clear, consistent field names
  3. Handle Missing Data: Instruct to use null for missing information
  4. Format Guidelines: Specify date formats, number formats, etc.
  5. Validation Rules: Add constraints for better accuracy

οΏ½πŸ› Troubleshooting

Common Issues

Local Development

# Module not found errors
uv sync

# AWS credentials not configured
aws configure

# Permission denied errors
aws sts get-caller-identity

Lambda Deployment

# Serverless deployment fails
npm install -g serverless serverless-python-requirements

# Function timeout
# Increase timeout in serverless.yml or use smaller files

# Memory errors
# Increase memory allocation in serverless.yml

API Testing

# Test local Lambda function
uv run python local_test.py

# Test deployed API
uv run python test_lambda.py --api-url YOUR_URL --file media/license.jpeg --mode t

# Check API Gateway logs
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/document-ingestion-and-text-extraction-api

File Size Issues

  • Local: No size limit (within AWS service limits)
  • Lambda: 6 MB request limit for base64 encoded files (~4.5 MB original file)

Performance Tips

  • Use --mode t for fastest processing (text only)
  • Smaller files process faster
  • Lambda has cold start delay (~1-3 seconds)

πŸ“Š Comparison: Local vs Lambda

Feature Local CLI Lambda API
Deployment No deployment needed Serverless deployment
Scaling Single instance Auto-scaling
File Upload Direct file path Base64 in JSON
AWS Credentials Local AWS config IAM role
Blur Detection Full OpenCV analysis Textract confidence analysis + API field
Timeout No limit 5 minutes
File Size AWS service limits 6 MB request limit
Cost Compute + AWS services Lambda + AWS services
Cold Start None 1-3 seconds

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test both local and Lambda versions
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License.


Happy Document Analysis! πŸŽ‰

About

AI-powered document analysis service combining AWS Textract, Bedrock, and intelligent blur detection. Supports CLI and serverless Lambda API for Malaysian documents (licenses, receipts, ID cards, utility bills).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages