Improve data extraction and document processing with Amazon Textract
This project provides a mechanism to use Amazon Textract to extract meaningful actionable data from a wide range of complex multi-format PDF files. PDF files are challenging, they can have a variety of data elements like headers, footers, tables with data in multiple columns, images, graphs, sentences and paragraphs in different formats. We explore the data extraction phase of IDP as shown in the following figure, and how they connect to the steps involved in a document process, such as ingestion, extraction and post processing.
You can either use AWS Cloud9 or your local to deploy this solution.
-
Download and install the latest version of Python for your OS from here. We will be using Python 3.8+.
-
You will need to install version 2 of the AWS CLI as well. If you already have AWS CLI, please upgrade to a minimum version of 2.0.5 following the instructions on the link above.
-
AWS CDK
-
Docker
-
Clone this repo to your local or Cloud9.
-
Run the following commands: pip install -r requirements.txt
cdk bootstrap
cdk deploy SimpleAsyncWorkflow
Follow the instructions in blog post.
This library is licensed under the MIT-0 License.