OPUS - a collection of parallel corpora and tools

Structure of the repository

python packages: opustools, polyglot, fast-mosestokenizer
Perl modules: OpusTools, Uplug and dependencies
subalign (for subtitle conversion and alignment)
pdftotext, recode, tidy, pigz, GNU parallel and other common GNU/Unix tools
Moses and eflomal (optional for word alignment and phrase table extraction)
the corpus work bench (CWB) and cwb Perl modules (optional for cwb index generation)
optional: yasa (our fork from https://github.com/Helsinki-NLP/yasa)

git clone git@github.com:Helsinki-NLP/OPUS-ingest.git
cd OPUS-ingest
git submodule update --init --recursive --remote
make install

The last step will most likely fail. Check error messages and the Makefile for details.

NOTE: The documentation belowe requires serious updates!

make build scripts more readable
consistent language codes
get rid of hard-coded paths to tools and make the repo more general and less depending on specific environments (like the one on puhti/CSC)
better documentation (as always)
more efficient pre-processing
consistent pre-processing (UD-based?)
more frequent corpus updates (Tatoeba, wikimedia and other frequently changing corpora)
streamline corpus creation, processing and maintenance procedures
improve integration/updates of OPUS-API and website updates
…