[📑 ALC 2023 Paper]
[📝 Slides]
[📽️ Presentation]
[📀 Datasets]
[⚙️ Demo]
This repository contains the data and the models described in the ACL2023 paper "Script Normalization for Unconventional Perso-Arabic Writing". The models are deployed on HuggingFace: Demo 🔥
- "mar7aba!"
- "هاو ئار یوو؟"
- "Μπιάνβενου α σε προζέ!"
What do all these sentences have in common? Being greeted in Arabic with "mar7aba" written in the Latin script, then asked how you are ("هاو ئار یوو؟") in English using the Perso-Arabic script of Kurdish and then, welcomed to this demo in French ("Μπιάνβενου α σε προζέ!") written in Greek script. All these sentences are written in an unconventional script.
Although you may find these sentences risible, unconventional writing is a common practice among millions of speakers in bilingual communities. In our paper entitled "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities", we shed light on this problem and propose an approach to normalize noisy text written in unconventional writing.
This repository provides codes and datasets that can be used to reproduce our paper or extend it to other languages. The focus of the current project is on some of the main languages that use a Perso-Arabic script, namely the followings:
- Azeri Turkish (aka South Azerbaijani /
azb
) - Kashmiri (
kas
) - Gilaki (
glk
) - Gorani (aka Hawrami,
hac
) - Northern Kurdish (aka Kurmanji,
kmr
) - Central Kurdish (aka Sorani,
ckb
) - Mazanderani (
mzn
) - Sindhi (
snd
) - Persian (
fas
) - Arabic (
arb
) - Urdu (
urd
)
Please note that this project does not aim for spell-checking and cannot correct errors beyond character normalization.
The data presented in the corpus
folder have been extracted from Wikipedia dumps and cleaned using wikiextractor, unless the name of the file doesn't include wiki
(followed by the date of the dump). Here are the sources of the material of the other languages:
- Kurmanji (kmr-arb) contains crawled text collected from the following websites: rojnameyaevro.com, www.gavtv.net and https://badinan.org/
- Gorani (hac-arb) is based on the Zaza-Gorani corpus.
- Sorani (ckb-arb) is based on Pewan and AsoSoft corpus corpora.
All the corpora are cleaned to a decent extent.
Wordlists in the wordlist
folder contain words that are extracted from the corpora based on certain frequency. Depending on the size and quality of the data, the frequency is in the range of 3 to 10, i.e. words that appear with a frequency of 3 to 10 are extracted as the vocabulary of the language.
To extract words from the corpora, run the following:
```
cat <file> | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -n > <file_wordlist>
```
Following this, common words in the source and target languages are identified and stored in the common
folder. For the target languages, dictionaries are used. This folder contains two sets of common words, whether written with the same spelling or slightly different, and is organized in two sub-folders:
corpus-based
contains files of common words in two languages based on a corpusdictionary-based
contains files of common words in two languages extracted from dictionaries
If the source language has a dictionary, the common words are provided in the dictionary-based
folder. Otherwise, check the corpus-based
folder. The merged vocabularies (based on corpus and dictionary) are provided in the common
folder. If a dictionary is not available for the source language, the common files in common
are the same as those at corpus-based
.
Information about the target scripts can be found at data/scripts
as follows:
Language | Target Script | Mapping |
---|---|---|
Kashmiri | Urdu | data/scripts/Kashmiri-Urdu.tsv |
Sindhi | Urdu | data/scripts/Sindhi-Urdu.tsv |
Mazanderani | Persian | data/scripts/Mazanderani-Persian.tsv |
Gilaki | Persian | data/scripts/Gilaki-Persian.tsv |
AzeriTurkish | Persian | data/scripts/AzeriTurkish-Persian.tsv |
Gorani | Kurdish | data/scripts/Gorani-Kurdish.tsv |
Gorani | Arabic | data/scripts/Gorani-Arabic.tsv |
Gorani | Persian | data/scripts/Gorani-Persian.tsv |
Kurdish | Arabic | data/scripts/Kurdish-Arabic.tsv |
Kurdish | Persian | data/scripts/Kurdish-Persian.tsv |
Also, find more meta-data about the usage of diacritics and zero-width non-joiner (ZWNJ) in each language at data/script/info.json
. A mapping of all the scripts is also provided at data/scripts/scripts_all.tsv
.
Calculating the edit distance based on the wordlists, a character alignment matrix (CAT) is created for each source-target language pair. This matrix contains the normalized probability that a character in a language appears as the equivalent of another one in the other language, i.e. compare the letter 'ج' in بۆرج in Azeri Turkish with برج in Farsi.
In addition to the edit distance, if there are rule-based mappings in the data/scripts
folder, the CAT is updated accordingly (by adding 1 for each mapping). Finally, any replacement with a score < 0.1 is removed from the matrix.
The synthetic datasets are available on GDrive due to large size (2.39G in .tar.gz
): link. The real data in Central Kurdish written unconventionally in Arabic and Persian can be found at data/real
.
If you are interested in this project and want to extend it, here are the steps to consider:
- Add your corpus to the
data/corpus
folder - Update the
code/config.json
file and specify directories to your data and other required files. - Run
extract_loanwords.py
to extract common words (script should be optimized!) - Add script mapping in TSV format to the
data/scripts
folder. - Run
create_CAT.py
to create the character-alignment matrix - Run
synthesize.py
to generate synthetic data.
You can use any NMT training platform of your choice for training your models. In the paper, we use joeynmt for which the configuration files are provided in the training
folder. If using SLURM, you can also use the scripts in training/SLURMs
.
Checkout the following related projects too:
- Language identification for Perso-Arabic Scripts
- Finite-state script normalization and processing utilities
If you use any part of the data, please consider citing this paper as follows:
@inproceedings{ahmadi2023acl,
title = "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities",
author = "Ahmadi, Sina and Anastasopoulos, Antonios",
month = july,
year = "2023",
address = "Tornonto, Cananda",
publisher = "The 61st Annual Meeting of the Association for Computational Linguistics (ACL)"
}