HNMT: the Helsinki Neural Machine Translation system

This is a neural network-based machine translation system developed at the University of Helsinki.

It is currently rather experimental, but the user interface and setup procedure should be simple enough for people to try out.

Features

biLSTM encoder which can be either character-based or hybrid word/character (Luong & Manning 2016)
LSTM decoder which can be either character-based or word-based
Beam search with coverage penalty based on Wu et al. (2016)).
Partial support for byte pair encoding (Sennrich et al. (2015))
Variational dropout (Gal 2015) and Layer Normalization (Ba et al. 2016)

Requirements

A GPU if you plan to train your own models
Python 3.4 or higher
Theano (use the development version)
BNAS
NLTK for tokenization, but note that HNMT also supports pre-tokenized data from external tokenizers
efmaral if you want to try the experimental supervised attention feature (not recommended, but see below)

Quick start

If Theano and BNAS are installed, you should be able to simply run hnmt.py. Run with the --help argument to see the available command-line options.

Training a model on the Europarl corpus can be done like this:

python3 hnmt.py --source europarl-v7.sv-en.en \
                --target europarl-v7.sv-en.sv \
                --source-tokenizer word \
                --target-tokenizer char \
                --source-vocabulary 50000 \
                --max-source-length 30 \
                --max-target-length 180 \
                --batch-size 32 \
                --training-time 24 \
                --log en-sv.log \
                --save-model en-sv.model

This will create a model with a hybrid encoder (with 50k vocabulary size and character-level encoding for the rest) and character-based decoder, filtering out sentences longer than 30 words (source) or 180 characters (target) and training for 24 hours. Development set cross-entropy and some other statistics appended to this file, which is usually the best way of monitoring training. Training loss and development set translations will be written to stdout, so redirecting this or using tee is recommended.

The resulting model can be used like this:

python3 hnmt.py --load-model en-sv.model \
                --translate test.en --output test.sv \
                --beam-size 10

Note that when training a model from scratch, parameters can be set on the commandline or otherwise the hard-coded defaults are ued. When continuing training or doing translation (i.e. whenever the --load-model argument is used), the defaults are encoded in the given model file, although some of these (that do not change the network structure) can still be overridden by commandline arguments.

For instance, the model above will assume that input files need to be tokenized, but passing a pre-tokenized (space-separated) input can be done as follows:

python3 hnmt.py --load-model en-sv.model \
                --translate test.en --output test.sv \
                --source-tokenizer space \
                --beam-size 10

Resuming training

You can resume training by adding the --load-model argument without using --translate (which disables training). For instance, if you want to keep training the model above for another 48 hours on the same data:

python3 hnmt.py --load-model en-sv.model
                --training-time 48 \
                --save-model en-sv-72h.model

Segmentation

Select the tokenizer among these options:

space: pre-segmented with spaces as separators
char: split into character sequences
word: use wordpunct from nltk
bpe: pre-segmented with BPE (remove '@@ ' from final output)

TODO: support BPE as internal segmentation (apply_bpe to training data)

Using efmaral for attention supervision (not recommended)

Install the Python bindings for efmaral (i.e. run python3 setup.py install in the efmaral directory).

Then you can simply add --alignment-loss 1.0 when training to activate this feature (the number specifies the contribution of alignment/attention cross-entropy to the loss function). By default this contribution has an exponential decay (per batch), this can be specified with --alignment-decay 0.9999 or such.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
hnmt		hnmt
moses-scripts		moses-scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
hnmt.py		hnmt.py
search.py		search.py
text.py		text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HNMT: the Helsinki Neural Machine Translation system

Features

Requirements

Quick start

Resuming training

Segmentation

Using efmaral for attention supervision (not recommended)

About

Releases

Packages

Languages

License

yvesscherrer/hnmt

Folders and files

Latest commit

History

Repository files navigation

HNMT: the Helsinki Neural Machine Translation system

Features

Requirements

Quick start

Resuming training

Segmentation

Using efmaral for attention supervision (not recommended)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages