This repository contains the data and the models described in the paper "Towards Machine Translation for the Kurdish Language". Please note that ku
and en
refer to Sorani Kurdish (=ckb
in ISO 639-3) and English, respectively.
We share technical details of our models here so that future systems can be compared with the current project as follows:
- Datasets: there are two sets of datasets, each one containing training, testing and validation datasets. These sets are called
Model 1
andModel 2
. - Tokenization models: you can use the tokenization models which are trained using the HuggingFace tokenizers and SentencePiece. Our models are trained using the following corpora:
- Word Embeddings: we used the fastText word vectors for Kurdish and GloVe for English.
- System outputs: the translation outputs of our best model for the two sets of data are provided in the output dicectory.
Shortly after this project, a set of parallel corpora containing Sorani-Kurmanji, Sorani-English and Kurmanji-English sentences was published. Check it out at https://github.com/KurdishBLARK/InterdialectCorpus.
If you use any part of the data, please consider citing this paper as follows:
@inproceedings{ahmadi-masoud-2020-towards,
title = "Towards Machine Translation for the {K}urdish Language",
author = "Ahmadi, Sina and
Masoud, Maraim",
booktitle = "Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages",
month = dec,
year = "2020",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.loresmt-1.12",
pages = "87--98"
}