This is the code used for our LoResMT 2021 Paper Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages. This repository contains the implementations of the subword segmentation algorithms we used, the cleaned Quechua dataset, and an Indonesian dataset, along with the pipeline of how our model is run.
pip install requirements.txt
- Originally from Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation
- The base code for PRPE was taken from https://github.com/zuters/prpe.
- Samples of the heuristics, separated out from the main algorithm for convenience, can be accessed below:
- Quechua Heuristic
- Indonesian Heuristic
- The generic heuristic (the general parameters of PRPE) can be found here
- The data we cleaned is found in
data/cleaned_source
.test.es.txt
andtest.qz.txt
were created by random shuffling of all of the parallel lines. The source data from Annette Rios can be found here. - The Religious, News, and General Indonesian-English datasets from Benchmarking Multidomain English-Indonesian Machine Translation can be found at their repository here.
- The Religious and Magazine data from Neural machine translation with a polysynthetic low resource language can be found here.
Our entire pipeline can be run with:
python pipeline.py
The pipeline can take in several flags:
--src_segment_type
and--tgt_segment_type
can benone
,bpe
,unigram
,prpe
,prpe_bpe
,prpe_multiN
(where N is number of iterations).--model_type
can bernn
(aka LSTM) ortransformer
. Defaults to LSTM.--in_lang
specifies the input language to be translated. We usedqz
for Quechua andid
for Indonesian. Defaults to Quechua.--out_lang
specifies the output language to be translated to. We usedes
for Spanish anden
for English. Defaults to Spanish.--domain
specifies the name of dataset to be used, which should be located indata/
under the same name. Defaults to religious.- A dataset folder should include:
train.{in_lang}.txt
validate.{in_lang}.txt
test.{in_lang}.txt
train.{out_lang}.txt
,validate.{out_lang}.txt
test.{out_lang}.txt
- Example:
train.qz.txt
,train.es.txt
for Quechua-Spanish translation.
- A dataset folder should include:
--train_steps
specifies how many steps the model should be trained. Default value is 100,000.--save_steps
specifies how often the trained model is saved. Default is every 10,000 steps.--validate_steps
specifies how often the model should be evaluated against the validation set. Default is every 2000 steps.--batch_size
is the batch size for training. Default is 64.--filter_too_long
specifies the max token length of a line in the training set. Any line that passes this value is filtered out. Default is no filtering.--src_token_lang
and--tgt_token_lang
specifies the tokenization language Moses uses. We usees
for both languages in QZ-ES, anden
for ID-EN.
The pipeline will automatically test the model after training is finished and output a BLEU and CHRF score.