Unsupervised Machine Translation (Transformer Based UNMT)

This repository provides the TensorFlow implementation of the transformer based unsupervised NMT model presented in
Phrase-Based & Neural Unsupervised Machine Translation (EMNLP 2018).

Requirements

Python 3
TensorFlow 1.12
Moses (clean and tokenize text)
fastBPE (generate and apply BPE codes)
fastText (generate embeddings)
(optional) MUSE (generate cross-lingual embeddings)

The data preprocessing script get_enfr_data.sh (copied from UnsupervisedMT-Pytorch, but remove the cmd of binarizing dataset by torch and add the special tokens to the vocabulary files) will take care of installing everything (except Python, TensorFlow).

Download / preprocess data

The first thing to do to download and preprocess data. To do so, just run:

cd UnsupervisedMT-TensorFlow
./get_enfr_data.sh

Note that there are several ways to train cross-lingual embeddings:

Train monolingual embeddings separately for each language, and align them with MUSE (please refer to the original paper for more details).
Concatenate the source and target monolingual corpora in a single file, and train embeddings with fastText on that generated file (this is what is implemented in the get_enfr_data.sh script).

The second method works better when the source and target languages are similar and share a lot of common words (such as French and English in get_enfr_data.sh). However, when the overlap between the source and target vocabulary is too small, the alignment will be very poor and you should opt for the first method using MUSE to generate your cross-lingual embeddings.

You can skip the script for preprocessing since it takes long time including downloading, learning/applying BPE, and fasttext training. Just download the prepared datasets, after running get_enfr_data.sh.

cd UnsupervisedMT-TensorFlow
./download_enfr_data.sh

Train the NMT model

./run.sh

The hyperparameter in run.sh are almost identical to the UnsupervisedMT-Pytorch except the batch_size=2048 which is a token level batch size.

On newstest2014 en-fr, the above command should give more than 22 BLEU after 100K steps training on a P100 (similar to Pytorch code).

Main Implementation Difference

In our code, the gradient of each update is computed by the summed loss from both directions: lang1 <-> lang2, while Pytorch code updates twice with the loss of each direction.

TODO

Mixed Data Loader (for training monoligual and parallel datasets together)
Multi-GPUs Training
Beam Search

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
model		model
LICENSE		LICENSE
README.md		README.md
download_enfr_data.sh		download_enfr_data.sh
get_data_enfr.sh		get_data_enfr.sh
main.py		main.py
run.sh		run.sh
supervised_iterator.py		supervised_iterator.py
trainer.py		trainer.py
trainer_utils.py		trainer_utils.py
unsupervised_iterator.py		unsupervised_iterator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Machine Translation (Transformer Based UNMT)

Requirements

Download / preprocess data

Train the NMT model

Main Implementation Difference

TODO

References Github

About

Releases

Packages

Languages

License

lovecambi/UnsupervisedMT-TensorFlow

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Machine Translation (Transformer Based UNMT)

Requirements

Download / preprocess data

Train the NMT model

Main Implementation Difference

TODO

References Github

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages