Skip to content
Shreeshrii edited this page Jun 15, 2015 · 17 revisions

Introduction

Tesseract is an Open Source OCR engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API. It supports a wide variety of languages.

Tesseract doesn't have a built-in GUI, but there are several available from the 3rdParty page.

Installation

There are two parts to install, the engine itself, and the training data for a language.

Linux

Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Packages are also generally available for language training data (search the repositories,) but if not you will need to download the appropriate training data, unpack it, and copy the .traineddata file into the 'tessdata' directory, probably /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata.

If Tesseract isn't available for your distribution, or you want to use a newer version than they offer, you can compile your own. Note that older versions of Tesseract only supported processing .tiff files.

Mac OS X

The easiest way to install Tesseract is with MacPorts. Once it is installed, you can install Tesseract by running the command sudo port install tesseract, and any language with sudo port install tesseract-<langcode>. List of available langcodes can be found on MacPorts tesseract page. Other option is to install tesseract using Homebrew with the command:

brew install tesseract

Windows

An installer is available for Windows from our download page. This includes the English training data.

If you want to use another language, download the appropriate training data, unpack it using 7-zip, and copy the .traineddata file into the 'tessdata' directory, probably C:\Program Files\Tesseract OCR\tessdata.

MSYS2

(Copied from post by rubenvb at http://stackoverflow.com/questions/29960825/error-during-making-xz-5-2-1-with-mingw-msys )

Install and update MSYS2. ( http://sourceforge.net/p/msys2/wiki/MSYS2%20installation/ )

Open an MSYS2 command prompt (or the 32-bit or 64-bit command prompts if you plan on building 32-bit or 64-bit things) from the start menu entries.

Install {32-bit,64-bit} MinGW-w64 GCC:

pacman -S mingw-w64-{i686,x86_64}-gcc

Install tesseract-OCR:

pacman -S mingw-w64-{i686,x86_64}-tesseract-ocr

and optionally the data files:

pacman -S mingw-w64-tesseract-ocr-osd mingw-w64-{i686,x86_64}-tesseract-ocr-eng

And you're done. Of course, you can still compile the various dependencies yourself, but why bother? If you really want to, you can start from the build scripts for the packages you can install in MSYS2, which are located here:

https://github.com/Alexpux/MINGW-packages

Just open the PKGBUILD files and you can see the build steps required. Note that all these scripts assume the dependencies have been installed within MSYS2.

Also note that the installed packages and compilers are all independent of MSYS2 as you'd expect: you can use it only as a tool to keep your development tree up to date, and build from any other Windows environment.

Other Platforms

Tesseract may work on more exotic platforms too. You can either try compiling it yourself, or take a look at the list of other projects using Tesseract.

Running Tesseract

Tesseract is a command-line program, so first open a terminal or command prompt. The command is used like this:

  tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

So basic usage to do OCR on an image called 'myscan.png' and save the result to 'out.txt' would be:

  tesseract myscan.png out

Or to do the same with German:

  tesseract myscan.png out -l deu

Tesseract also includes a hOCR mode, which produces a special HTML file with the coordinates of each word. This can be used to create a searchable pdf, using a tool such as Hocr2PDF. To use it, use the 'hocr' config option, like this:

  tesseract myscan.png out hocr

You can create searchable pdf directly from tesseract (>=3.03):

  tesseract myscan.png out pdf

More information about the various options is available in the Tesseract manpage.

Other Languages

Tesseract has been trained for many languages, check for your language on the Downloads page. It can also be trained to support other languages and scripts; for more details see TrainingTesseract.

Development

Tesseract can also be used in your own project, under the terms of the Apache License 2.0. It has a fully featured API, and can be compiled for a variety of targets including Android and the iPhone. See the 3rdParty page for a sample of what has been done with it.

Also, it's free software, so if you want to pitch in and help, please do! If you find a bug and fix it yourself, the best thing to do is to attach the patch to your bug report in the Issues List

Support

First read the Wiki, particularly the FAQ to see if your problem is addressed there. If not, search the the Tesseract user forum or the Tesseract developer forum, and if you still can't find what you need, please ask us there.

Clone this wiki locally