tesseract-demo

This is a demo of a small app to try and parallelize ocr operations on tiff files.

I've been asked to investigate on the possibility to speed-up tesseract ocr engine, which is great and free from google, but can slow down dramatically a complex workflow where tons of multi-paged documents have to be converted to plain text. Ocr operations can be thought to be very suitable to be parallelised, and even to be implemented on heterogeneous architectures (CPU + GPU): for example, see http://www.cs.uky.edu/~raphael/grad/keepingCurrent/reed-ocr.pdf

Tesseract's team itself was putting effort in running in a heterogeneous environment, getting involved ATI's people to translate the engine in OpenCl; I've taken contact with them after some trials, and I realised there still was much to do.

So I profiled tesseract to see whether there was room for a significative speed-up by writing the only ocr functions with CUDA. The results have suggested not to proceed in that direction, since the execution time was spread among many functions (I can share a small analysis carried out via valgrind's tool callgrind coupled with a graphic interface - Kcachegrind if you're interested).

So I decided to proceed this way:

one parallelisation can be made running in a multiprocess fashion, one instance of tesseract per page of a document;
the other - more fun! - was trying and use tesseract APIs to use multi-threading on a single page, doing ocr one thread per line (or per word);
couple the two things together.

I also use vectorisation, with no significative results.

Results where quite nice, showing a good scaling efficiency up to 4 threads, and then degradating; the multi-process parallelisation behaved as expected, the speed-up being almost linear with the number of pages (and then of instances) up to a 10 pages documents - more or less.

The small code here is organised very easily: the .sh file is meant to be run with a tiff file as argument, it will read the number of pages in the tiff file, then it will launch as many instances of tessapi-quality as the number of pages. tessapi-quality is the executable generated via the makefile (just type make) by compiling and linking tessapi-quality.cpp.

Don't know if I'm going to make other gym with this code soon - I know it deserves hard gym! - maybe if I find anyone interested..

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
launch_omp_tess.sh		launch_omp_tess.sh
makefile		makefile
pdf2tif.sh		pdf2tif.sh
tessapi-quality.cpp		tessapi-quality.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tesseract-demo

About

Releases

Packages

Languages

fvisconti/tesseract-demo

Folders and files

Latest commit

History

Repository files navigation

tesseract-demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages