Skip to content
Stefan Weil edited this page May 17, 2023 · 3 revisions

Introduction

Akkadian is the language spoken and written in Ancient Mesopotamia, from around 2600 to 600 BCE, and had academic/liturgical use up to 100 AD. It was written using the Cuneiform script. Two most important variants of Akkadian are Babylonian and Assyrian. Cuneiform signs have different forms in these two variants, which becomes important below in the training section.

Data set

The data set was scraped from the website of the ORACC (Open Richly Annotated Cuneiform Corpus). Only those ORACC projects having texts written in Cuneiform were scraped: RIBo: Royal Inscriptions of Babylonia online, RINAP: Royal Inscriptions of the Neo-Assyrian Period, and SAAo: State Archives of Assyria Online.

The scraper was written in Python 3.5 and is uploaded here. For each scraped document, it opens the popup with cuneiform text, extracts its contents, cleans them up and saves them into text files. Then all the saved text files were simply concatenated to a single file. This file is the Tesseract training text corpus.

Text corpus was then processed with the create_dictdata tool from the pytesstrain package, which created additional files needed for training a Tesseract model. These training files were added both to the LSTM-based language data repository (here) and to the legacy language data repository (here).

Training

The data set was used for training Tesseract models for Akkadian language. Below you can read about training the LSTM-based Tesseract model and then, for the sake of completeness, about training the legacy Tesseract model.

Training LSTM-based Tesseract (4 and above) model

This setup was first created by Shreeshrii here and updated by the author here.

The training runs on the Ubuntu 22.04, or any other Linux distribution with Tesseract 4 or above.

First, update the package database and install the required packages (replace vim with your favourite editor):

apt update
apt install -y tesseract-ocr git vim bc python3-pip python3-venv

Then, check out the language data and the training repository:

git clone https://github.com/wincentbalin/tesstrain-akk
cd tesstrain-akk
# Get a cup of tea after executing the next line
git clone https://github.com/tesseract-ocr/langdata_lstm

Create Python virtual environment, activate it and install required packages:

python3 -m venv venv
. ./venv/bin/activate
pip3 install kraken pytesstrain

Add ISRI Analytic Tools for OCR Evaluation with UTF-8 support:

apt install -y build-essential libutf8proc-dev
git clone https://github.com/eddieantonio/ocreval
cd ocreval
make
make install
cd ..

Then install the ocrevalUAtion tool:

apt install -y wget default-jre
mkdir ~/ocrevaluation
wget -O ~/ocrevaluation/ocrevaluation.jar https://github.com/impactcentre/ocrevalUAtion/releases/download/v1.3.4/ocrevalUAtion-1.3.4-jar-with-dependencies.jar

Now, install the fonts — just copy the *.[ot]tf from the following list to /usr/share/fonts:

List the real font names and, if needed, update the font list (you can use any installed editor instead of vim):

text2image --fonts_dir /usr/share/fonts --list_available_fonts
vim akk.fontslist.txt

Copy the language data for Akkadian:

cp -a langdata_lstm/akk langdata

If needed, adjust Tesseract configuration by adding the file akk.config to the language data:

cp akk.config langdata

To adjust the iteration count, go to the last line in the script trainlayer.sh, which starts with MAX_ITERATIONS=, and change the value.

Then execute the scripts of Shreeshrii:

# Prepare training data
bash txt2img.sh | tee txt2img.log
bash img2lstmf.sh | tee img2lstmf.log
# Perform training
bash trainlayer.sh | tee trainlayer.log
# Evaluate results
bash checkpointeval.sh | tee reports/checkpointeval.txt

The models are created from checkpoint files while training and are copied to the directories data/akk/tessdata_best and data/akk/tessdata_fast. Study the evaluation results in the file reports/checkpointeval-summary.txt (more information in the file reports/checkpointeval.txt) and choose the most suitable model for your needs.

Training for legacy Tesseract (3.05)

Training for Tesseract 3 was done using this Makefile. It performs a conventional Tesseract 3 training workflow. The fonts used are CuneiformNA, CuneiformOB and CuneiformComposite (all three downloaded from ORACC), as well as Segoe UI Historic (supplied with current Windows OS). The exposures used ranged from -3 to 3.

The training runs on the Ubuntu 16.04, or any other Linux distribution with Tesseract 3.04 or 3.05.

First, update the package database and install the required packages (replace vim with your favourite editor):

apt update
apt install -y tesseract-ocr git vim wget python3-pip python3-venv

Create Python virtual environment, activate it and install required packages (package versions are adjusted for Ubuntu 16.04, because of Python 3.5):

python3 -m venv venv
. ./venv/bin/activate
pip3 install wheel
pip3 install Pillow==7.2 pytesseract==0.3.6 pytesstrain

To be able to calculate additional font metrics, install Nick White's tools:

apt install -y libpango1.0-dev
git clone http://ancientgreekocr.org/grctraining.git
cd grctraining
make tools/addmetrics tools/xheight
cp tools/addmetrics tools/xheight /usr/local/bin
cd ..

Add fonts like described in the previous section and, additionally, list their names:

text2image --fonts_dir /usr/share/fonts --list_available_fonts

Get the Makefile:

wget https://gist.githubusercontent.com/wincentbalin/9329a6e994852ed477ba30ef4c29e71c/raw/0d2a8a76eea42698ede299406accca8b361c00fe/Makefile

Clone language data:

git clone https://github.com/tesseract-ocr/langdata.git ../langdata

To change the fonts to train with, edit the variables FONTS and FONTSJOINED (the latter one is needed only because make does not process values with spaces correctly). Add complementing font rules at the bottom of the Makefile.

To change the training corpus, edit the variable CORPUS or set it in the command line (i.e. make CORPUS=another_textfile.txt).

Run make or, to utilise 4 processors, make -j 4.

The training lasts ~1.5 days with supplied training text and 4 fonts. The resulting file is akk.traineddata in the current directory.

Results

The .zip archive with models for legacy Tesseract and LSTM-based Tesseract both in best and in fast variant is available here. All models within this archive were trained with 9 fonts.

The models for LSTM-based Tesseract with 4 fonts are available here.

``

Clone this wiki locally