Skip to content

Easy way to convert scanned documents into editable text documents, classifying key-value pairs and annotating them. MTX - HackOlympics 2.0 Shaastra - 2022

Notifications You must be signed in to change notification settings

FrozenWolf-Cyber/OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo



Easy way to convert scanned documents into an editable text document,
classifying key-value pairs and annotating them

Train results »    Download Results and Models »  
Live Demo »    Preview »   

Table of Contents

ocr

ocr

About

        Combining CRAFT, Faster R-CNN, Tesseract and Siamese neural network model to make an Optical character recognition software which is hosted in azure cloud here (Note : Annotation works only in Firefox). The neural network models are trained with the help of PyTorch on FUND dataset and the server is hosted in a virtual machine in azure cloud using Flask. The frontend website consists of options for users to upload a scanned document of files of formats - .png, .jpg, .jpeg, .pdf (for pdf only the first page is considered) which is in return is converted into editable text, bounding boxes for each word and sentences, classified labels for each sentence among 'other', 'question', 'answer' and 'header' and also the linked sentences. The website also provides a user-friendly interface for users to modify the model predictions using annotate features which can also be done to a document without feeding it to the model waiting for model predictions from scratch.

        The annotation interface is made with the help of annotorious.js. After the model result or after annotating the document the information can be downloaded into simple .txt format. There are also options to run the model offline so that multiple images can be fed to the images at once and it is also an option to decide if the output should be of MTX format or FUND dataset format.

        I am running the models in Azure VM because of the requirement of Tesseract and Popper. I am using Standard B2s (2 vcpus, 4 GiB memory) in Azure VM with Linux (ubuntu 18.04) as the operating system. I have added the videos and images of accessing the website which has been hosted through Azure VM but currently, I am unable to keep the VM open all the time due to interruption when the SSH connection is closed (I start the server in Azure VM using PuTTY to through SSH connection). But the same result can still be achieved by following the server installation and starting setup given below. I will be leaving the server open for as long as possible in a whole day so it might so the link might sometimes work.

     Most of the model training is done with the help of Pytorch. I have explained the training steps and the metrics to analyze the models in training. You can download all the trained models and public test dataset predictions here.

Website link :

http://frozenwolf-ocr.westeurope.cloudapp.azure.com:5000/home

Submission Link :

https://drive.google.com/drive/folders/1rcIWV1qp_k9rbPBL-IcCa_fp1fHW7auG?usp=sharing

Built Using

Python :

Flask
pickle-mixin
numpy
Pillow
regex
pdf2image
opencv-python
scikit-image
torch
torchvision
pytesseract

Javascript :

bootstrap
annotorious

Installation

Dependencies :

tesseract-ocr
poppler-utils

1.Install server Requirements :

Minimal Installation through command :

Note: The libraries installed through this process are targeted for Ubuntu Python 3.6 version. Also, the Pytorch CPU version is installed in this case to minimize memory usage

pip install -r requirements.txt

Additional Training Install Requirements (Optional) :

Note : This is required only if you want to run the .ipynb training notebooks in training folder

matplotlib
seaborn
nltk
torchinfo
albumentations

Finally after installing requirements, clone the repo

git clone https://github.com/FrozenWolf-Cyber/OCR.git

Usage

1.Starting the server :

Project structure :

server:
|   app.py
|   craft.py
|   craft_utils.py
|   imgproc.py
|   ocr_predictor.py
|   refinenet.py
|   word_Detection.py
|   
+---basenet
|   |   vgg16_bn.py
|   |   
|   \---__pycache__
|           vgg16_bn.cpython-39.pyc
|                      
+---img_save
|       requirements.txt
|       
+---saved_models
|       craft_mlt_25k.pth
|       craft_refiner_CTW1500.pth
|       embs_npa.npy
|       faster_rcnn_sgd.pth
|       siamese_multi_head.pth
|       vocab
|       
+---static
|   \---assets
|       +---bootstrap
|       |   +---css
|       |   |       bootstrap.min.css
|       |   |       
|       |   \---js
|       |           bootstrap.min.js
|       |           
|       +---css
|       |       animated-textbox-1.css
|       |       animated-textbox.css
|       |       annotorious.min.css
|       |       Codeblock.css
|       |       custom.css
|       |       custom_annotate.css
|       |       Drag--Drop-Upload-Form.css
|       |       Features-Blue.css
|       |       Footer-Basic.css
|       |       Navigation-Clean.css
|       |       PJansari---Horizontal-Stepper.css
|       |       steps-progressbar.css
|       |       
|       +---fonts
|       |       ionicons.eot
|       |       ionicons.min.css
|       |       ionicons.svg
|       |       ionicons.ttf
|       |       ionicons.woff
|       |       material-icons.min.css
|       |       MaterialIcons-Regular.eot
|       |       MaterialIcons-Regular.svg
|       |       MaterialIcons-Regular.ttf
|       |       MaterialIcons-Regular.woff
|       |       MaterialIcons-Regular.woff2
|       |       
|       +---img
|       |       bg-masthead.jpg
|       |       bg-showcase-2.jpg
|       |       bg-showcase-3.jpg
|       |       
|       \---js
|               annotate.js
|               annotorious.min.js
|               annotorious.umd.js.map
|               bs-init.js
|               navigator.js
|               recogito-polyfills.js
|               result.js
|               upload.js
|               
+---status
|       requirements.txt
|       
+---temp
|       requirements.txt
|       
+---templates
       annotate.html
       home.html
       result.html
       upload.html
       upload_annotate.html

To start the server run the app.py inside the server folder

python app.py



Local-Server-Demo.mp4



2.Predicting mutliple scanned documents offline :

To run this program minimal installation is enough

Project structure

batch_run
|   app.py
|   craft.py
|   craft_utils.py
|   demo_batch_run.png
|   imgproc.py
|   ocr_predictor.py
|   predict.py
|   refinenet.py
|   tree.txt
|   word_Detection.py
|   
+---basenet
|   |   vgg16_bn.py
|   |   
|   \---__pycache__
|           vgg16_bn.cpython-39.pyc
|           
+---img_save
|       
+---result
|       
+---saved_models
|       craft_mlt_25k.pth
|       craft_refiner_CTW1500.pth
|       embs_npa.npy
|       faster_rcnn_sgd.pth
|       siamese_multi_head.pth
|       vocab
|       
+---testing_data
   +---documents
   |       your_pdf1.pdf
   |       your_pdf2.pdf
   |       your_pdf3.pdf
   \---images
        your_image1.png
        your_image2.png

Custom run :

Prediction :

Inside batch_run folder run,

python predict.py -path <target folder> -MTX <Y/N> -sr <Y/N> -pdf <Y/N>
usage: predict.py [-h] [-path PATH] [-MTX MTX] [-sr SR] [-pdf PDF]

optional arguments:
  -h, --help            show this help message and exit
  -path PATH, --path PATH
                        Use relative path
  -MTX MTX, --MTX MTX   Should be <Y> or <N>. If <Y> then the output will be in MTX Hacker Olympics format, if <N>
                        then the output will be of FUND dataset format
  -sr SR, --sr SR       Should be <Y> or <N>. If <Y> then the output will be saved in a seperate JSON file whereas the
                        scores for each label classification and linking will be in seperate file, if <N> then the
                        both will be in same file
  -pdf PDF, --pdf PDF   Should be <Y> or <N>. If <Y> then the target folder contains multiple .pdf documents, if <N>
                        then the folder contains multiple .png,.jpg,.jpeg documents

Example :

python predict.py -path testing_data/images -MTX Y -sr N -pdf N


image


python predict.py -path testing_data/documents -MTX Y -sr N -pdf Y


image


Evalution :

Inside batch_run folder run,

python evaluate.py -img <Image folder> -anno <Annotations folder> -sr <Y/N>

optional arguments:
  -h, --help            show this help message and exit
  -img IMG_PATH, --img_path IMG_PATH
                        Use relative path
  -anno ANNO_PATH, --anno_path ANNO_PATH
                        Use relative path
  -sr SR, --sr SR       Should be <Y> or <N>. If <Y> then the output will be saved in a seperate JSON file whereas the
                        scores for each label classification and linking will be in seperate file, if <N> then the
                        both will be in same file

Example :


python evaluate.py -img testing_data/images -anno testing_data/annotations -sr Y

image


Each prediction and score are saved in the result folder as a .json file together or separate based on the custom configuration you have selected. In case of evaluation additional metrics.json file is saved, it contains label and linking accuracy, f_score, precision and recall value of each image seperately.

3. Website :

Note : In the website the format the model returns is that of the FUND dataset, for MTX evaluation purposes go to batch_run where you can choose the output format.Annotation works only in Firefox

Hosting in azure VM

Azure-Demo_low.mp4



image


OCR-VM


Home

There are options to annotate after model predictions or else to start annotating from scratch

Home - OCR

Upload

You can either drag and drop the images or just select them. The images should be of form .png or .jpeg or .jpg or .pdf Note: For .pdf files, the first page alone will be considered

Upload - OCR

Progress

After getting the model output using either can continue to modify their bounding box, label, translation, and linking predictions in annotations or finish it by downloading it in the form of a .txt

Result - OCR

Annotate

Using annotorius.js the annotation can be now done very much easier. To modify the words you have to click any one of the corresponding sentences. After completing annotating the images used can either download the final result in the .txt form. Instead of waiting for model predictions to come, users can choose to annotate from scratch too.

Annotate - OCR

RCNN Performance: ocr2

Icons made by Freepik from www.flaticon.com

About

Easy way to convert scanned documents into editable text documents, classifying key-value pairs and annotating them. MTX - HackOlympics 2.0 Shaastra - 2022

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages