A comprehensive document analysis tool that combines AWS Textract, Bedrock, and intelligent blur detection. Available as both CLI and serverless Lambda API.
Β© 2025 Goodbye World team, for Great AI Hackathon Malaysia 2025 usage.
- Python 3.10+
- uv package manager
- AWS CLI configured with appropriate permissions
# Clone and setup
cd document-ingestion-and-text-extraction
uv sync# Activate the virtual environment
.venv\Scripts\activate
# Check which Python is being used
uv run python --version
# Show environment information
uv run python -c "import sys; print(sys.executable)"
# Create a new virtual environment (if needed)
uv venv
# Sync dependencies
uv sync
# Sync with active environment (ignores conflicting VIRTUAL_ENV)
uv sync --active# Analyze a driver's license locally
uv run python cli.py --file media/license.jpeg --mode tfbq --category license
# Test Lambda function locally
uv run python local_test.py
# Test deployed Lambda API
uv run python test_lambda.py --api-url YOUR_API_URL --file media/license.jpeg --mode tfbq --category license
# Test deployed Lambda API with category auto-detection
uv run python test_lambda.py --api-url YOUR_API_URL --file media/license.jpeg- Local CLI Usage
- Lambda API
- Features
- Project Structure
- Quick Reference & Commands
- API Usage Examples
- Blur Analysis API Field
- Auto-Detection Features
- API Reference
- Configuration
- Advanced Usage
- Developer Guide: Custom Queries and Prompts
- Troubleshooting
- Comparison: Local vs Lambda
- Contributing
# Basic syntax
uv run python cli.py --file <path> --mode <mode> [options]| Argument | Description | Default | Required |
|---|---|---|---|
--file |
Path to input file (JPEG/PNG/PDF) | - | β |
--mode |
Analysis mode: t(ext), f(orms), b(tables), q(uery) | tfbq |
β |
--category |
Document type: idcard, license, license-front, license-back, tnb, receipt (auto-detected if not provided) |
- | β |
--queries |
Custom queries separated by semicolons or newlines | - | β |
--prompt |
Custom prompt for Bedrock AI extraction | - | β |
--custom |
Use custom queries/prompts even if category files exist | False |
β |
--region |
AWS region | us-east-1 |
β |
--profile |
AWS profile name | default |
β |
# Full analysis with auto-detection (no category needed)
uv run python cli.py --file media/license.jpeg --mode tfbq --region us-east-1
# Full analysis of a driver's license (explicit category)
uv run python cli.py --file media/license.jpeg --mode tfbq --category license --region us-east-1
# TNB utility bill analysis
uv run python cli.py --file media/tnb-bill.pdf --mode tfbq --category tnb --region us-east-1
# License front side analysis
uv run python cli.py --file media/license-front.jpeg --mode tfbq --category license-front --region us-east-1
# Text extraction only with blur detection
uv run python cli.py --file media/license.jpeg --mode t --region us-east-1
# Forms and tables analysis
uv run python cli.py --file media/license.jpeg --mode fb --region us-east-1
# Auto-detection with custom queries/prompts
uv run python cli.py --file media/license.jpeg --mode tfbq --custom --queries "What is the issuing authority?" --region us-east-1# Install dependencies
npm install -g serverless serverless-python-requirements
set BEDROCK_MODEL=amazon.nova-lite-v1:0 # change to your desired model
# Deploy to AWS
serverless deploy --region us-east-1# Create deployment package
python deploy_lambda.py --function-name document-ingestion-and-text-extraction-api --region us-east-1# Test Lambda function locally (simulates Lambda environment)
uv run python local_test.py
# Test health endpoint
uv run python local_test.py --health# Test deployed API
uv run python test_lambda.py --api-url https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze --file media/receipt.pdf --mode tfbq --category receipt
# Create web test interface
uv run python test_lambda.py --create-html
# Then open test_lambda.html in browserBy default, the CLI runs in local mode with OpenCV support for blur detection. To simulate the Lambda environment without OpenCV, set the following environment variable:
SET LAMBDA_RUNTIME = falseFor Lambda-like behavior (Textract confidence-based blur detection only), set:
SET LAMBDA_RUNTIME = truecurl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze \
-H "Content-Type: application/json" \
-d '{
"file_content": "<base64-encoded-file-content>",
"filename": "document.pdf",
"mode": "tfbq",
"custom": false,
"region": "us-east-1"
}'Note: May output
The input line is too long..
OR using Python script:
uv run python test_api.pyNote: Update
base64_contentintest_api.pywith your base64-encoded file content.
curl https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/health- Text Detection: Extract text with confidence scores
- Form Analysis: Key-value pair extraction
- Table Analysis: Structured table data extraction
- Query Analysis: Answer specific questions about documents
- Auto Category Detection: Automatically detect document type using AI
- Local: OpenCV Laplacian variance + Textract confidence analysis
- Lambda: Enhanced Textract confidence analysis with statistical metrics
- Metrics: Average, median, and standard deviation of OCR confidence
- Quality Assessment: Excellent, good, fair, or poor ratings
- API Integration: Structured
blur_analysisfield in Lambda responses
- Structured Extraction: Convert documents to structured JSON
- Document Categories: Specialized prompts for different document types
- Auto Category Detection: AI-powered document classification
- Custom Mode: Override category-based prompts and queries
- AI-Powered: Uses Claude AI for intelligent data extraction
- Local CLI: Full-featured command-line interface
- Lambda API: Serverless REST API with automatic scaling
- Consistent Results: Same analysis quality in both environments
document-ingestion-and-text-extraction/
βββ src/ # Core source code
β βββ __init__.py # Package initialization
β βββ main.py # Main CLI logic
β βββ textract_enhanced.py # Textract integration
β βββ bedrock_mapper.py # Bedrock integration
β βββ category_detector.py # Auto-detection logic
β βββ blur_detection.py # Blur detection logic
β βββ logger.py # Logging utilities
β βββ sample_response.json # Sample API response for reference
β βββ prompts/ # Bedrock prompts
β β βββ idcard.txt
β β βββ license.txt
β β βββ license-front.txt
β β βββ license-back.txt
β β βββ receipt.txt
β β βββ tnb.txt
β βββ queries/ # Textract queries
β βββ idcard.txt
β βββ license.txt
β βββ license-front.txt
β βββ license-back.txt
β βββ receipt.txt
β βββ tnb.txt
βββ media/ # Sample test files
β βββ blur.jpg # Blurry test image
β βββ exceed-5mb.pdf # Large file test
β βββ exceed-pages.pdf # Multi-page test
β βββ half-blur.jpg # Partially blurred image
β βββ license.jpeg # Driver's license sample
β βββ mingjia-license.jpg # License sample
β βββ receipt.pdf # Receipt sample
β βββ tnb.png # TNB utility bill sample
β βββ unsupported-file-type.xlsx # Unsupported format test
βββ log/ # Local analysis results
β βββ {filename}_{timestamp}/ # Individual analysis logs
β βββ textract.log # Complete processing log
β βββ text.json # Text detection results
β βββ forms.json # Form analysis results
β βββ tables.json # Table analysis results
β βββ queries.json # Query analysis results
β βββ blur_analysis.json # Blur detection results
β βββ category_detection.json # Auto-detection results
βββ output/ # Extracted structured data
βββ .env # Environment variables (not tracked)
βββ .gitignore # Git ignore patterns
βββ api.py # Standalone API server
βββ cli.py # CLI entry point
βββ lambda_handler.py # Lambda function handler
βββ local_test.py # Local Lambda testing
βββ test_api.py # API testing with requests
βββ test_lambda.py # Lambda API testing
βββ test_lambda.html # Web-based Lambda testing interface (Generate with `python test_lambda.py --create-html`)
βββ deploy_lambda.py # Deployment script
βββ serverless.yml # Serverless Framework config
βββ package.json # Node.js dependencies for Serverless
βββ package-lock.json # Node.js lock file
βββ pyproject.toml # Project dependencies
βββ requirements.txt # Lambda-specific dependencies
βββ uv.lock # UV package manager lock file
βββ README.md # This file
# Basic Setup
uv sync # Install/update dependencies
# Local Analysis (Recommended)
uv run python cli.py --file media/license.jpeg --mode tfbq # Auto-detect + full analysis
uv run python cli.py --file media/receipt.pdf --mode tfbq # Receipt analysis
uv run python cli.py --file media/document.pdf --mode t # Text extraction only
# Lambda Testing
uv run python local_test.py # Test Lambda locally
serverless deploy --region us-east-1 # Deploy to AWS
uv run python test_lambda.py --api-url YOUR_URL --file media/license.jpeg| Mode | Description | Use Case | Speed |
|---|---|---|---|
t |
Text detection only | Quick text extraction | β‘β‘β‘ |
f |
Forms analysis | Key-value pairs | β‘β‘ |
b |
Tables analysis | Structured table data | β‘β‘ |
q |
Query analysis | Answer specific questions | β‘ |
tfbq |
All analysis types | Complete document analysis (recommended) | β‘ |
| Category | Documents | Auto-Detect | Manual Specify |
|---|---|---|---|
| Auto | All supported documents | β Recommended | --mode tfbq |
license |
Driver's licenses (any side) | β | --category license |
license-front |
License front side only | β | --category license-front |
license-back |
License back side only | β | --category license-back |
idcard |
ID cards, national IDs | β | --category idcard |
receipt |
Purchase receipts | β | --category receipt |
tnb |
TNB utility bills | β | --category tnb |
# Quick Analysis (Most Common)
uv run python cli.py --file document.pdf --mode tfbq # Full auto-analysis
uv run python cli.py --file document.pdf --mode t # Text only (fastest)
# Specific Document Types
uv run python cli.py --file license.jpg --mode tfbq --category license
uv run python cli.py --file receipt.pdf --mode tfbq --category receipt
uv run python cli.py --file bill.pdf --mode tfbq --category tnb
# Custom Analysis
uv run python cli.py --file document.pdf --mode q --queries "What is the date?;What is the amount?"
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract as JSON: name, date, amount"
# API Testing
uv run python local_test.py # Test locally
uv run python test_lambda.py --create-html # Create web interface// File upload and analysis
const fileInput = document.getElementById('file');
const file = fileInput.files[0];
const reader = new FileReader();
reader.onload = async function(e) {
const fileContent = e.target.result.split(',')[1];
const response = await fetch('https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
file_content: fileContent,
filename: file.name,
mode: 'tfbq',
custom: False,
region: 'us-east-1'
})
});
const result = await response.json();
// Access blur analysis
if (result.blur_analysis) {
const blur = result.blur_analysis;
const textract = blur.textract_analysis;
const overall = blur.overall_assessment;
console.log(`Quality: ${textract.quality_assessment}`);
console.log(`Is Blurry: ${overall.is_blurry}`);
console.log(`Confidence: ${overall.confidence_level}`);
console.log(`Median Confidence: ${textract.median_confidence.toFixed(2)}%`);
// Quality-based processing
if (textract.quality_assessment === 'excellent') {
console.log('High quality image - proceed with confidence');
} else if (overall.is_blurry) {
console.log('Blurry image detected - results may be less accurate');
}
}
};
reader.readAsDataURL(file);import requests
import base64
# Read and encode file
with open('media/license.jpeg', 'rb') as f:
file_content = base64.b64encode(f.read()).decode('utf-8')
# Make API call
response = requests.post(
'https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze',
json={
'file_content': file_content,
'filename': 'license.jpeg',
'mode': 'tfbq',
'custom': False,
'region': 'us-east-1'
}
)
result = response.json()
# Access blur analysis
if 'blur_analysis' in result:
blur_info = result['blur_analysis']
print(f"Quality: {blur_info['textract_analysis']['quality_assessment']}")
print(f"Is Blurry: {blur_info['overall_assessment']['is_blurry']}")
print(f"Confidence: {blur_info['overall_assessment']['confidence_level']}")# Encode file to base64
FILE_CONTENT=$(base64 -w 0 media/license.jpeg)
# Make API call
curl -X POST https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze \
-H "Content-Type: application/json" \
-d "{
\"file_content\": \"$FILE_CONTENT\",
\"filename\": \"license.jpeg\",
\"mode\": \"tfbq\",
\"custom\": false,
\"region\": \"us-east-1\"
}"The Lambda API now includes a dedicated blur_analysis field that provides comprehensive image quality assessment:
{
"blur_analysis": {
"laplacian": {
"method": "laplacian",
"score": 4743.317898724083,
"is_blurry": false,
"quality": "sharp"
},
"textract_analysis": {
"total_items": 53,
"min_confidence": 34.21656036376953,
"max_confidence": 99.990234375,
"median_confidence": 96.84188079833984,
"average_confidence": 95.69545777638753,
"std_confidence": 11.519045489650455,
"low_confidence_count": 22,
"low_confidence_percentage": 41.509433962264154,
"likely_blurry": false,
"quality_assessment": "excellent"
},
"overall_assessment": {
"is_blurry": false,
"blur_indicators": [],
"confidence_level": "high"
}
}
}| Field | Type | Description | Values |
|---|---|---|---|
method |
string | Analysis method used | "laplacian" |
score |
float | Laplacian variance score | 0.0+ (higher = sharper) |
is_blurry |
boolean | Laplacian-based blur detection | true, false |
quality |
string | Laplacian-based quality assessment | "sharp", "moderate", "blurry" |
| Field | Type | Description | Values |
|---|---|---|---|
total_items |
integer | Number of text items detected | 0+ |
min_confidence |
float | Lowest confidence score | 0.0 - 100.0 |
max_confidence |
float | Highest confidence score | 0.0 - 100.0 |
median_confidence |
float | Median confidence score | 0.0 - 100.0 |
average_confidence |
float | Average confidence score | 0.0 - 100.0 |
std_confidence |
float | Standard deviation of confidence scores | 0.0+ |
low_confidence_count |
integer | Number of items below 85% confidence | 0+ |
low_confidence_percentage |
float | Percentage of low-confidence items | 0.0 - 100.0 |
likely_blurry |
boolean | Textract-based blur assessment | true, false |
quality_assessment |
string | Overall quality rating | "excellent", "good", "fair", "poor" |
| Field | Type | Description | Values |
|---|---|---|---|
is_blurry |
boolean | Final blur detection result | true, false |
blur_indicators |
array | Methods that detected blur | [], ["textract"], ["laplacian"], ["laplacian", "textract"] |
confidence_level |
string | Confidence in the assessment | "high", "medium", "low" |
| Quality | Median Confidence | Average Confidence | Description |
|---|---|---|---|
| Excellent | > 95% | > 90% | Very high quality, clear text |
| Good | > 90% | > 85% | Good quality, readable text |
| Fair | > 85% | > 80% | Acceptable quality, mostly readable |
| Poor | β€ 85% | β€ 80% | Poor quality, difficult to read |
An image is considered blurry if ANY of these conditions are met:
- Very low median confidence:
median_confidence < 80.0 - Very low average confidence:
average_confidence < 75.0 - High percentage of poor items:
low_confidence_percentage > 50.0(>50% below 85%) - Extreme inconsistency with poor quality:
std_confidence > 20.0 AND median_confidence < 85.0
| Level | Median | Average | Low Conf % | Description |
|---|---|---|---|---|
| High | > 95% | > 90% | < 20% | Very confident assessment |
| High | > 90% | > 85% | < 35% | Confident assessment |
| Medium | > 85% | > 80% | < 50% | Moderately confident |
| Low | β€ 85% | β€ 80% | β₯ 50% | Low confidence assessment |
| Indicators | Meaning | Confidence |
|---|---|---|
[] |
No blur detected by any method | High |
["textract"] |
Only confidence analysis detected blur | Medium |
["laplacian"] |
Only image analysis detected blur (local only) | Medium |
["laplacian", "textract"] |
Both methods detected blur (local only) | Very High |
The system automatically detects document categories using AI analysis of extracted text, forms, and tables:
# Auto-detection (recommended - no --category needed)
uv run python cli.py --file document.pdf --mode tfbqHow it works:
- Initial Analysis: Extracts text, forms, and tables using Textract
- AI Classification: Uses Bedrock Claude AI to analyze content and classify document type
- Category Assignment: Applies detected category for queries and prompts
- Results Saved: Detection results saved to
category_detection.json
Supported Categories:
idcard- Identity cards, national IDs, employee IDslicense- Driver's license, driving permits (combined/single-sided)license-front- Front side of driver's license specificallylicense-back- Back side of driver's license specificallytnb- TNB utility bills, electricity billsreceipt- Purchase receipts, invoices from retail stores
Detection Confidence:
- High confidence (0.7-1.0): Very reliable classification
- Medium confidence (0.4-0.7): Moderately reliable
- Low confidence (0.0-0.4): Less reliable, may need manual verification
# Specify category explicitly (skips auto-detection)
uv run python cli.py --file document.pdf --mode tfbq --category tnbUse --custom to override category-based files with your own queries/prompts:
# Custom mode with explicit queries (ignores category query files)
uv run python cli.py --file document.pdf --mode q --custom --queries "What is the date?;What is the amount?"
# Custom mode with explicit prompt (ignores category prompt files)
uv run python cli.py --file document.pdf --mode tfb --custom --prompt "Extract all dates as JSON"
# Custom mode for categories without extensive default files
uv run python cli.py --file license-back.jpeg --mode q --category license-back --queries "What is the license number?"Custom Mode Rules:
- If
--customis used and no custom queries/prompts provided, system checks for category files - All new categories have supporting files, but
--customcan override them - Use custom mode to test new queries or prompts for existing categories
=== TEXT DETECTION ===
text = "Sample Text" | confidence = 99.89
=== FORM ANALYSIS ===
Key: Value pairs extracted from forms
=== TABLE ANALYSIS ===
Structured table data with rows and columns
=== QUERY ANALYSIS ===
Q: What is the transaction amount?
A: $100.00
=== BLUR DETECTION ===
Textract confidence - Median: 99.89, Avg: 99.75, Std: 0.51
Quality assessment: excellent
Overall: CLEAR (confidence: high)
=== BEDROCK EXTRACTION ===
{
"transaction_date": "2025-09-15",
"transaction_amount": "$100.00",
"beneficiary_name": "John Doe"
}
{
"status": "success",
"console_output": "Processing log...",
"text": [{ "text": "Sample Text", "confidence": 99.89 }],
"forms": {
"Key": ["Value"]
},
"tables": {
"tables": [{ "table_id": 1, "rows": [["Cell1", "Cell2"]] }]
},
"queries": {
"What is the amount?": "$100.00"
},
"category_detection": {
"detected_category": "receipt",
"confidence": 0.95,
"timestamp": "2025-09-15T12:34:56+00:00"
},
"blur_analysis": {
"textract_analysis": {
"median_confidence": 99.89,
"average_confidence": 99.62,
"std_confidence": 0.51,
"quality_assessment": "excellent"
},
"overall_assessment": {
"is_blurry": false,
"confidence_level": "high"
}
},
"extracted_data": {
"transaction_date": "2025-09-15",
"transaction_amount": "$100.00"
}
}{
"error": "Error description",
"returncode": 1,
"stdout": "...",
"stderr": "..."
}Create a .env file in the project root to configure AWS credentials and API endpoints:
# AWS Credentials (choose one method)
# Method 1: Direct credentials (for development)
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
# Method 2: Use AWS CLI profile (recommended for production)
# Leave AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY empty to use AWS CLI profile
# Configure with: aws configure --profile your-profile-name
# Lambda API Endpoints (optional - for testing deployed APIs)
ANALYZE_API_URL=https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/analyze
ANALYZE_HEALTH_API_URL=https://your-api-id.execute-api.us-east-1.amazonaws.com/dev/health
# Runtime Configuration
BEDROCK_MODEL=amazon.nova-lite-v1:0 # Change to desired Bedrock model
LAMBDA_RUNTIME=false # Set to 'true' to simulate Lambda environment locally
AWS_REGION=us-east-1 # Default AWS region
# AWS_PROFILE=default # AWS CLI profile to use| Variable | Required | Description | Example |
|---|---|---|---|
AWS_ACCESS_KEY_ID |
β | AWS access key for authentication | AKIAIOSFODNN7EXAMPLE |
AWS_SECRET_ACCESS_KEY |
β | AWS secret key for authentication | wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY |
ANALYZE_API_URL |
β | Lambda API endpoint for document analysis | https://abc123.execute-api.us-east-1.amazonaws.com/dev/analyze |
ANALYZE_HEALTH_API_URL |
β | Lambda API health check endpoint | https://abc123.execute-api.us-east-1.amazonaws.com/dev/health |
LAMBDA_RUNTIME |
β | Simulate Lambda environment locally | true or false (default: false) |
AWS_REGION |
β | AWS region for services | us-east-1, us-west-2, etc. |
AWS_PROFILE |
β | AWS CLI profile name | default, dev, prod |
- Copy the template above to create your
.envfile - Choose authentication method:
- Direct credentials: Fill in
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY - AWS CLI profile: Leave credentials empty, set
AWS_PROFILEto your profile name
- Direct credentials: Fill in
- Configure API URLs (only needed for testing deployed Lambda functions)
- Set region and other optional variables as needed
- Never commit
.envto version control (it's already in.gitignore) - Use AWS CLI profiles for production environments
- Rotate credentials regularly and use IAM roles when possible
- Use least-privilege permissions (see AWS Permissions Required below)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"textract:DetectDocumentText",
"textract:AnalyzeDocument",
"bedrock:InvokeModel"
],
"Resource": "*"
}
]
}- PDF: Up to 11 pages, max 5 MB
- JPEG/JPG: Max 5 MB
- PNG: Max 5 MB
- Request Size: 6 MB (affects base64 file uploads)
- Timeout: 5 minutes maximum
- Memory: Configurable up to 10 GB
- Blur Detection: Uses Textract confidence analysis (no OpenCV)
Provide custom questions for Textract to answer about the document:
# Single query
uv run python cli.py --file document.pdf --mode q --queries "What is the total amount?"
# Multiple queries (semicolon or newline separated)
uv run python cli.py --file document.pdf --mode q --queries "What is the date?;What is the amount?;Who is the recipient?"
# Multiline format
uv run python cli.py --file document.pdf --mode q --queries "What is the transaction date?
What is the reference number?
What is the beneficiary name?"Query Best Practices:
- Ask specific, direct questions
- Use clear, simple language
- Questions should be answerable from visible text
- Avoid overly complex or interpretive questions
Provide custom instructions for Bedrock AI to extract structured data:
# Basic extraction
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract all monetary amounts and dates as JSON"
# Structured JSON extraction
uv run python cli.py --file receipt.pdf --mode tfb --prompt "Extract: {\"merchant\": \"store name\", \"total\": \"amount as number\", \"date\": \"YYYY-MM-DD format\"}"
# Bank receipt extraction
uv run python cli.py --file bank-receipt.pdf --mode tfb --prompt "Extract transaction details: amount, date, beneficiary name, reference ID as JSON"Prompt Guidelines:
- Specify desired output format (JSON recommended)
- Define field names and data types
- Include formatting instructions (date formats, etc.)
- Specify how to handle missing data (use null)
# Auto-detection with custom queries
uv run python cli.py --file document.pdf --mode tfbq --queries "Additional question?"
# Explicit category with custom prompt
uv run python cli.py --file document.pdf --mode tfb --category receipt --prompt "Custom extraction prompt"
# Custom mode overriding category files
uv run python cli.py --file document.pdf --mode tfbq --category license --custom --queries "Custom questions" --prompt "Custom prompt"
# Bank receipt with required custom content
uv run python cli.py --file bank-receipt.pdf --mode q --category bank-receipt --custom --queries "What is the transaction amount?"{
"status": "success",
"console_output": "Processing log...",
"text": [{"text": "...", "confidence": 99.5}],
"forms": {"key": "value"},
"tables": [{"headers": [], "rows": []}],
"queries": {"question": "answer"},
"blur_analysis": {
"laplacian": {"score": 4743.32, "is_blurry": false, "quality": "sharp"},
"textract_analysis": {"median_confidence": 96.84, "quality_assessment": "excellent"},
"overall_assessment": {"is_blurry": false, "confidence_level": "high"}
},
"extracted_data": {"field": "value"}
}export AWS_REGION=us-east-1
export AWS_PROFILE=default# Run the test script
uv run python test_auto_detection.py
# Manual testing
uv run python cli.py --file media/license.jpeg --mode tfbq
# Check log/{filename}_{timestamp}/category_detection.json for resultsQueries are questions that Textract will attempt to answer based on the document content.
-
Be Specific: Ask for exact information you need
β Good: "What is the expiry date?" β Avoid: "What are the dates?" -
Use Clear Language: Simple, direct questions work best
β Good: "What is the full name?" β Good: "What is the license class?" β Good: "What is the address?" -
Avoid Duplicates: Don't repeat queries from category files
# Check existing queries first cat src/queries/license.txt -
Format Correctly: Separate multiple queries with semicolons or new lines
# Using semicolons --queries "What is the license number?;What is the transaction amount?;What is the account number?" # Using new lines (in scripts or multi-line input) --queries "What is the license number? What is the transaction amount? What is the account number?"
Driver's License (src/queries/license.txt):
What is the date of birth?
What is the expiry date?
What is the license validity period?
What is the license number?
What is the string below license number?
What is the license class?
What is the address?
License Front (src/queries/license-front.txt):
what is the identity No.?
What is the date of birth?
what is the nationality?
What is the license class?
What is the license validity period?
What is the address?
License Back (src/queries/license-back.txt):
What is the license number?
Receipt (src/queries/receipt.txt):
Who is the beneficiary?
What is the beneficiary account number?
Which bank is receiving the payment?
What is the recipient reference?
What is the reference ID?
What are the payment details / description?
What is the transaction amount?
When was the transaction successfully completed?
ID Card (src/queries/idcard.txt):
What is the full name?
What is the ID number?
What is the address?
What is the gender?
TNB Bill (src/queries/tnb.txt):
What is the No. Akaun (account number)?
What is the No. Invois (invoice number)?
Prompts are used by Bedrock AI for structured data extraction in src/prompts/{category}.txt.
Driver's License (src/prompts/license.txt):
Extract Malaysian driving license fields from the provided data.
Return STRICTLY valid JSON matching this schema:
{
"full_name": string|null,
"identity_no": string|null,
"license_number": string|null,
}
CRITICAL RULES:
- ONLY extract data that is EXPLICITLY present in the input
- DO NOT make up or guess any values
- If a field is not found, use null
- Full name: ONLY if explicitly found in the data
- Identity number: ONLY from "No. Pengenalan / Identity No." field
- license number: ONLY a combination of 2 parts
* first part, 7-digit numeric codes that are clearly license numbers (NOT dates, NOT identity numbers), e.g. "1234567"
* second part, 8-digit alphanumeric codes that are a randomised mix of upper/lowercase letters and/or numbers, e.g. "AbC12xYz"
* join the two parts with a space in between, e.g. "1234567 AbC12xYz"
- Return only valid JSON, no explanations
Receipt (src/prompts/receipt.txt):
Extract the following information from this Payment Receipt text and return as JSON:
{
"transaction_amount": "Transaction amount as displayed",
"transaction_type": "Transaction type as printed",
"merchant": "Merchant name as printed",
"payment_method": "Payment method as printed",
"date_time": "Date and time of transaction",
"wallet_reference": "Wallet reference number as printed",
"transaction_status": "Transaction status as printed",
"transaction_number": "Transaction number as printed"
}
Rules:
- Return only valid JSON.
- Use the exact text as printed, don't interpret.
- If any field is missing, return null instead of skipping.
- For "transaction_amount", always return a positive value.
ID Card (src/prompts/idcard.txt):
Extract the following information from this ID card text and return as JSON:
{
"full_name": "Full name of the ID holder",
"userId": "ID card number",
"gender": "Gender if available",
"address": "Full address if available",
}
Rules:
- Use null for missing information
- Extract exact text, don't interpret
- Return only valid JSON
TNB Bill (src/prompts/tnb.txt):
Extract the following information from this TNB Bill text and return as JSON:
{
"account_number": "Exact value of No. Akaun as printed on the bill",
"invoice_number": "Exact value of No. Invois as printed on the bill",
}
Rules:
- Use null if No. Akaun is not found
- Extract exact text, don't interpret or reformat
- Return only valid JSON
# Custom queries only
uv run python cli.py --file document.pdf --mode q --queries "Your question?"
# Category + custom queries
uv run python cli.py --file document.pdf --mode tfbq --category license --queries "Additional question?"
# Custom prompt for AI extraction
uv run python cli.py --file document.pdf --mode tfb --prompt "Extract specific fields as JSON"
# Via Lambda API with custom prompt
uv run python test_lambda.py --file document.pdf --prompt "Your custom prompt" --api-url YOUR_URLThe --prompt parameter allows you to override category-based prompts for Bedrock AI extraction, enabling rapid prototyping and adaptation to new document types.
Simple Extraction:
--prompt "Extract the name, date, and amount from this document and return as JSON."Structured JSON Output:
--prompt "Extract the following information and return as JSON:
{
\"document_type\": \"type of document\",
\"issuer\": \"issuing organization\",
\"recipient\": \"recipient name\",
\"date_issued\": \"date in YYYY-MM-DD format\",
\"amount\": \"monetary amount as number\",
\"reference_number\": \"reference or ID number\"
}"Receipt Analysis:
--prompt "Analyze this receipt and extract:
{
\"merchant\": \"store name\",
\"date\": \"transaction date\",
\"total\": \"total amount\",
\"tax\": \"tax amount\",
\"items\": [\"list of purchased items\"]
}
Return only valid JSON."Invoice Processing:
--prompt "Extract invoice details as JSON:
{
\"invoice_number\": \"invoice ID\",
\"vendor\": \"vendor name\",
\"customer\": \"customer name\",
\"date\": \"invoice date\",
\"due_date\": \"payment due date\",
\"subtotal\": \"subtotal amount\",
\"tax_rate\": \"tax percentage\",
\"total\": \"total amount\"
}"- Specify Output Format: Always request JSON for structured data
- Define Field Names: Use clear, consistent field names
- Handle Missing Data: Instruct to use null for missing information
- Format Guidelines: Specify date formats, number formats, etc.
- Validation Rules: Add constraints for better accuracy
# Module not found errors
uv sync
# AWS credentials not configured
aws configure
# Permission denied errors
aws sts get-caller-identity# Serverless deployment fails
npm install -g serverless serverless-python-requirements
# Function timeout
# Increase timeout in serverless.yml or use smaller files
# Memory errors
# Increase memory allocation in serverless.yml# Test local Lambda function
uv run python local_test.py
# Test deployed API
uv run python test_lambda.py --api-url YOUR_URL --file media/license.jpeg --mode t
# Check API Gateway logs
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/document-ingestion-and-text-extraction-api- Local: No size limit (within AWS service limits)
- Lambda: 6 MB request limit for base64 encoded files (~4.5 MB original file)
- Use
--mode tfor fastest processing (text only) - Smaller files process faster
- Lambda has cold start delay (~1-3 seconds)
| Feature | Local CLI | Lambda API |
|---|---|---|
| Deployment | No deployment needed | Serverless deployment |
| Scaling | Single instance | Auto-scaling |
| File Upload | Direct file path | Base64 in JSON |
| AWS Credentials | Local AWS config | IAM role |
| Blur Detection | Full OpenCV analysis | Textract confidence analysis + API field |
| Timeout | No limit | 5 minutes |
| File Size | AWS service limits | 6 MB request limit |
| Cost | Compute + AWS services | Lambda + AWS services |
| Cold Start | None | 1-3 seconds |
- Fork the repository
- Create a feature branch
- Make your changes
- Test both local and Lambda versions
- Submit a pull request
This project is licensed under the MIT License.
Happy Document Analysis! π