Amazon Textract PDF Text Extractor

Improve data extraction and document processing with Amazon Textract

This project provides a mechanism to use Amazon Textract to extract meaningful actionable data from a wide range of complex multi-format PDF files. PDF files are challenging, they can have a variety of data elements like headers, footers, tables with data in multiple columns, images, graphs, sentences and paragraphs in different formats. We explore the data extraction phase of IDP as shown in the following figure, and how they connect to the steps involved in a document process, such as ingestion, extraction and post processing.

Solution Architecture

Prerequisites

You can either use AWS Cloud9 or your local to deploy this solution.

Prerequisites for local setup

Download and install the latest version of Python for your OS from here. We will be using Python 3.8+.
You will need to install version 2 of the AWS CLI as well. If you already have AWS CLI, please upgrade to a minimum version of 2.0.5 following the instructions on the link above.
AWS CDK
Docker

Deployment Instructions

Clone this repo to your local or Cloud9.
Run the following commands: pip install -r requirements.txt

cdk bootstrap

cdk deploy SimpleAsyncWorkflow

Execution Instructions

Follow the instructions in blog post.

License

This library is licensed under the MIT-0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
lambda		lambda
test		test
textract_pdf_extraction_stack		textract_pdf_extraction_stack
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
cdk.json		cdk.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Textract PDF Text Extractor

Solution Architecture

Prerequisites

Prerequisites for local setup

Deployment Instructions

Execution Instructions

Further Reading:

License

About

Releases

Packages

Languages

License

kein903/PDF-Extract

Folders and files

Latest commit

History

Repository files navigation

Amazon Textract PDF Text Extractor

Solution Architecture

Prerequisites

Prerequisites for local setup

Deployment Instructions

Execution Instructions

Further Reading:

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages