Skip to content

Latest commit

 

History

History
50 lines (23 loc) · 1.91 KB

README.md

File metadata and controls

50 lines (23 loc) · 1.91 KB

translator-structure

File Introduction

data.py: preprocess datasets including loading, embedding, padding, and batch

model.py: implementation of Transformer

train.py: several training methods

loadData.ipynb: use torchtext to load dataset from disk, numericalize words, generate voacb, and batch them

Training Example

characterCopy.py: a simple copying experiment to test model

IWSLTGeEnTranslation.py: IWSLTG Ge-En experiment

Pretrained Chinese Word Embedding

Although there are several pretrained word embeddings, the segmentation methods can hugely affect the performance of embedding and downstream tasks.

Name Format Algorithm Dimension
Tencent AI Lab Embedding Corpus for Chinese Words and Phrases text (.txt) DSG (directional skip-gram) 200
fastText text (.txt) & binary (.bin) CBOW
(n=5, window=5, negative=10)
300
Wikipedia2Vec text (.txt) & binary (.bin) skip-gram
word-based
(window=5, iteration=10, negative=15)
100 & 300

fastText uses Stanford Word Segmenter for Chinese, the same toolkit I used to tokenize infoq corpus. Also fastText provides a easy to use tool (both skip-gram and CBOW are available) to generate your own embedding.

Web Development

Machine Translation Web Interface for OpenNMT

First start server:

cd website

/bin/bash start_server.sh path/to/OpenNMT-tf

Then start this website