Digitization for Documents in Indian Languages

A repository that combines OCR with custom Layout Detection for Sanskrit and English documents. Simply upload your images or pdfs for OCR in the 'test images' folder of this repository, choose a model for OCR and Layout Detection, and recieve the OCR output and infered image in your output folder!

To run lp_ocr.py, wrapper for Layout Parser OCR for the first time:

Create a venv and activate:

virtualenv lp_ocr
source lp_ocr/bin/activate

Install all packages in the environment:

pip3 install -r requirements.txt
apt install tesseract-ocr
apt install libtesseract-dev
apt-get install poppler-utils

For layout detection and OCR:
- Run lp_ocr.py and select 'yes' when asked if layout detection should be applied
- Choose a custom layout model. eg. Choose a Sanskrit model if your image/pdf is in an Indic language.
- Choose an OCR model.
- Define your output folder name and find the OCR'd text + a Label Studio formatted output retrieved from infered bounding boxes from the Layout Detection model.
For document layout analysis of an image:
- Run layout_inference.py. This will return an infered image with masks and a json file with layout data - bounding boxes.
For OCR of a directory of images:
- Run lp_ocr.py and select 'no' when asked if layout detection should be applied, and supply your input image directory.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.idea		.idea
Tutorials		Tutorials
configs		configs
indicparser		indicparser
test_img		test_img
test_pdf		test_pdf
.DS_Store		.DS_Store
README.md		README.md
custom_labels_weights.yml		custom_labels_weights.yml
layout_inference.py		layout_inference.py
lp_ocr.py		lp_ocr.py
packages.txt		packages.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digitization for Documents in Indian Languages

About

Releases

Packages

Languages

Saurabhbaghel/indic-parser

Folders and files

Latest commit

History

Repository files navigation

Digitization for Documents in Indian Languages

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages