GitHub - philopatrick/Document_OCR: Handwritten document ocr

Document OCR

This project is for scan the handwritten document, and transform them into text files. By using the latest deep learning models (CRAFT for text detection, TROCR for OCR transformation). This combinations can achieve very accurate results than other methods. Especially, the transformer OCR could both care about the individual word form but also the context information in the visual representation which can result better results than other methods. Related information, please read their papers()

Set up

For GPU use

conda env create -n docOCR --file environment.yml

For CPU use

conda env create -n docOCR --file environment-cpu.yml

The normal steps to do document ocr

The single OCR model cannot automatically convert a whole page of text image into the text forms. That's not the way normal ocr model work. Usually, because of trainning limitation, the model can only convert one word or one line of words into its' text forms. So, before we want to do a whole page of document ocr, we need to detect the location of the word or the location of the whole line words.

FIRST STEP, measure and crop page

So, the first step is to detect word location, here we use CRAFT-pytorch project, which is a word detection project, it can do one single word location detection and with their refiner model to detect a line of words(But, it is not very good perfermance in line text detection).

Here is one example document under processing

Although you can see, when we have some normal page, there still have some unwanted part in a page, so we can use some tools to crop the page. You can use ./utils/image_measure.py to measure the point pixel position, and later use ./utils/image_crop.py to crop the image.

After croped the image, we have the interest part we care as follow.

If all your images in a folder have a similiar layout, you can use the process_folder function to process them all. However this maybe can be done with some other deep learning model with layout guessing ability(TODO).

SECOND STEP, detect the word in document

You can detect the the image with CRAFT-pytorch\test_raw.py use: bash python test_raw.py --trained_model=CRAFT-pytorch\craft_mlt_25k.pth --test_folder=CRAFT-pytorch\testimgraw --text_threshold=0. here is the result:

However you can directly load those cropped detected word into the OCR model. But this usually doesn't promise you a good result, cause it is just individual word lost the context.

So, I reform a little bit to use the word detection to the line detection. I join the whole line of word detection into lines of bounding boxes like below:

THIRD STEP, OCR

You can pass the folder contain the cropped images in CRAFT-pytorch\test.py

## pass the folder path
images_all = glob.glob(r"D:\NLP\语料\语料\**\**\**\CroppedImage.png")

And then run with:

python test.py --trained_model=CRAFT-pytorch\craft_mlt_25k.pth --test_folder=CRAFT-pytorch\testimg --text_threshold=0.7

And then the Recognized text are stored in each image path folder in txt extention.

Remind

This script is write in a mess, there are many parameters can be tuned, if you have any problem contact me.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CRAFT-pytorch		CRAFT-pytorch
asset		asset
utils		utils
README.md		README.md
environment-cpu.yml		environment-cpu.yml
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document OCR

Set up

The normal steps to do document ocr

FIRST STEP, measure and crop page

SECOND STEP, detect the word in document

THIRD STEP, OCR

Remind

About

Releases

Packages

Languages

philopatrick/Document_OCR

Folders and files

Latest commit

History

Repository files navigation

Document OCR

Set up

The normal steps to do document ocr

FIRST STEP, measure and crop page

SECOND STEP, detect the word in document

THIRD STEP, OCR

Remind

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages