Process for Gathering structured data out of a PDF

This git repo describes a process for a pharmaceutical company on how to derive structured informations from release forms to structured form. In this case structured form means a csv file and a document in a MongoDB.

This is done using Python. The applied modules are explained in the notebook.

Business Problem

A pharmaceutical company needs to match their batch numbers to some other information. Unfortunately all these information is only stored in printed out documents. But fortunately they all have the same structure and the same wording. See an example below.

We need to extract the ChB, MatNr, FS; GI and FOL number and save them (think about that this operation has to be carried out on several 1000s of documents) in a structured way as mentioned above.

So all the paper documents are scanned into a single big document with several 1000s pages.

Files

Batch_GI_PDF.ipynb

This notebook describes thw whole process.

Batch_GI_PDF.py

This is a production-ready Python script.

You can use the script for many documents (with single or multiple pages) at once.

Please be aware, that this only works for image-like pdfs. Pure text pdf documents will be much faster and easier, but these are not covered in this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
Batch_GI_PDF.ipynb		Batch_GI_PDF.ipynb
Batch_GI_PDF.py		Batch_GI_PDF.py
img1.png		img1.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Process for Gathering structured data out of a PDF

Business Problem

Files

Batch_GI_PDF.ipynb

Batch_GI_PDF.py

About

Releases

Packages

Languages

ChristophGmeiner/BatchGI

Folders and files

Latest commit

History

Repository files navigation

Process for Gathering structured data out of a PDF

Business Problem

Files

Batch_GI_PDF.ipynb

Batch_GI_PDF.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages