data.py: preprocess datasets including loading, embedding, padding, and batch
model.py: implementation of Transformer
train.py: several training methods
loadData.ipynb: use torchtext to load dataset from disk, numericalize words, generate voacb, and batch them
characterCopy.py: a simple copying experiment to test model
IWSLTGeEnTranslation.py: IWSLTG Ge-En experiment
Although there are several pretrained word embeddings, the segmentation methods can hugely affect the performance of embedding and downstream tasks.
Name | Format | Algorithm | Dimension |
---|---|---|---|
Tencent AI Lab Embedding Corpus for Chinese Words and Phrases | text (.txt) | DSG (directional skip-gram) | 200 |
fastText | text (.txt) & binary (.bin) | CBOW (n=5, window=5, negative=10) |
300 |
Wikipedia2Vec | text (.txt) & binary (.bin) | skip-gram word-based (window=5, iteration=10, negative=15) |
100 & 300 |
fastText uses Stanford Word Segmenter for Chinese, the same toolkit I used to tokenize infoq corpus. Also fastText provides a easy to use tool (both skip-gram and CBOW are available) to generate your own embedding.
Machine Translation Web Interface for OpenNMT
First start server:
cd website
/bin/bash start_server.sh path/to/OpenNMT-tf
Then start this website