This is a neural network-based machine translation system developed at the University of Helsinki.
It is currently rather experimental, but the user interface and setup procedure should be simple enough for people to try out.
- biLSTM encoder which can be either character-based or hybrid word/character (Luong & Manning 2016)
- LSTM decoder which can be either character-based or word-based
- Beam search with coverage penalty based on Wu et al. (2016)).
- Partial support for byte pair encoding (Sennrich et al. (2015))
- Variational dropout (Gal 2015) and Layer Normalization (Ba et al. 2016)
- A GPU if you plan to train your own models
- Python 3.4 or higher
- Theano (use the development version)
- BNAS
- NLTK for tokenization, but note that HNMT also supports pre-tokenized data from external tokenizers
- efmaral if you want to try the experimental supervised attention feature (not recommended, but see below)
If Theano and BNAS are installed, you should be able to simply run
hnmt.py
. Run with the --help
argument to see the available command-line
options.
Training a model on the Europarl corpus can be done like this:
python3 hnmt.py --source europarl-v7.sv-en.en \
--target europarl-v7.sv-en.sv \
--source-tokenizer word \
--target-tokenizer char \
--source-vocabulary 50000 \
--max-source-length 30 \
--max-target-length 180 \
--batch-size 32 \
--training-time 24 \
--log en-sv.log \
--save-model en-sv.model
This will create a model with a hybrid encoder (with 50k vocabulary size and
character-level encoding for the rest) and character-based
decoder, filtering out sentences longer than 30 words (source) or 180
characters (target) and training for 24 hours. Development set cross-entropy
and some other statistics appended to this file, which is usually the best way
of monitoring training. Training loss and development set translations will be
written to stdout, so redirecting this or using tee
is recommended.
The resulting model can be used like this:
python3 hnmt.py --load-model en-sv.model \
--translate test.en --output test.sv \
--beam-size 10
Note that when training a model from scratch, parameters can be set on the
commandline or otherwise the hard-coded defaults are ued. When continuing
training or doing translation (i.e. whenever the --load-model
argument is
used), the defaults are encoded in the given model file, although some of
these (that do not change the network structure) can still be overridden by
commandline arguments.
For instance, the model above will assume that input files need to be tokenized, but passing a pre-tokenized (space-separated) input can be done as follows:
python3 hnmt.py --load-model en-sv.model \
--translate test.en --output test.sv \
--source-tokenizer space \
--beam-size 10
You can resume training by adding the --load-model
argument without using
--translate
(which disables training). For instance, if you want to keep
training the model above for another 48 hours on the same data:
python3 hnmt.py --load-model en-sv.model
--training-time 48 \
--save-model en-sv-72h.model
Select the tokenizer among these options:
- space: pre-segmented with spaces as separators
- char: split into character sequences
- word: use wordpunct from nltk
- bpe: pre-segmented with BPE (remove '@@ ' from final output)
TODO: support BPE as internal segmentation (apply_bpe to training data)
Install the Python bindings for
efmaral (i.e. run
python3 setup.py install
in the efmaral
directory).
Then you can simply add --alignment-loss 1.0
when training to activate this
feature (the number specifies the contribution of alignment/attention
cross-entropy to the loss function). By default this contribution has an
exponential decay (per batch), this can be specified with
--alignment-decay 0.9999
or such.