Skip to content
zdenop edited this page May 22, 2015 · 1 revision

A list of useful control parameters and config files.

Introduction

Tesseract is extremely flexible, if you know how to control it. There is a large number of control parameters to modify its behaviour. While these change from time to time, most of them are fairly stable. List of all parameters with default value and short description can be retrieved with:

tesseract --print-parameters

There are 3 different types:

Init only

Characterized by INIT in its initialization macro. These parameters can only be set at the TessBaseAPI::Init function that takes a list of config files.

The rest can be set through TessBaseAPI::SetVariable and make 2 further groups:

General parameters

Control many different aspects of Tesseract's functionality.

Debug parameters

Contain debug in their name, control huge amounts of optional debug text and graphical output as Tesseract works.

Useful parameters

Note that the default value may change; check the source code if you need to be sure of it.

Name Type Default value Init only Description
load_system_dawg boolean (0/1) 1 Yes Controls whether or not to load the main dictionary for the selected language.
user_words_suffix string "" Yes The extension of the users-words word list file. If non-empty, it will attempt to load the relevant list of words to add to the dictionary for the selected language. Eg if set to user-words Tesseract will attempt to load eng.user-words from the tessdata directory at initialization time.
language_model_penalty_non_dict_word double (0-1) 0.15 No The penalty to apply to words not in the word_dawg / user_words wordlists.
language_model_penalty_non_freq_dict_word double (0-1) 0.1 No The penalty to apply to words not in the freq_dawg wordlist.

Useful parameters for Japanese and Chinese

Some Japanese tesseract user found these parameters helpful for increasing tesseract-ocr (3.02) accuracy for Japanese :

Name Suggested value Description
chop_enable T Chop enable.
use_new_state_cost F Use new state cost heuristics for segmentation state evaluation
segment_segcost_rating F Incorporate segmentation cost in word rating?
enable_new_segsearch 0 Enable new segmentation search path. It could solve the problem of dividing one character to two characters
language_model_ngram_on 0 Turn on/off the use of character ngram model.
textord_force_make_prop_words F Force proportional word segmentation on all rows.
edges_max_children_per_outline 40 Max number of children inside a character outline. Increase this value if some of KANJI characters are not recognized (rejected).

If you find a parameter that you consider useful, please add a comment to this page, and we will copy it into the page as we find it.