This project compares and evaluates monolingual paraphrasing of English, German, Czech, and Slovene sentences, along with multilingual paraphrasing across these languages. Monolingual datasets are generated and evaluated using human evaluation. By comparing their scores with the scores of existing monolingual datasets, we estimate their quality. The models, on the other hand, are evaluated using the metric Parascore, which helps us analyse the effectiveness of each model for each language. This way we discover the advantages and disadvantages that come with using a mono- or multilingual dataset and training for multilingual paraphrasing of sentences.
The repository doesn't collect any new dataset. Instead, we have decided to leverage the already existing ones. We use the ParaCrawl dataset which consists of lots of sentences in different languages. We use maching translation models from huggingface to create paraphrase data from this translation dataset. While other multilingual parallel datasets include sentence pairs within a language (i.e. paraphrases), they include only few if any of these paraphrase sentence pairs in medium resource languages like Slovene. With our approach we create similarly sized paraphrase datasets for different languages including medium resource languages by leveraging translation data, which is more widely available than paraphrase data.
Our generated data can be accessed on huggingface:
- ParaCrawl-enen
- ParaCrawl-dede
- ParaCrawl-slsl
- ParaCrawl-cscs
- ParaCrawl-multi_all
- ParaCrawl-multi_small
We evaluate the quality of our monolingual datasets via human evaluation of a dataset sample and in direct comparison to other popular paraphrase datasets. We evaluate semantic similarity and lexical divergence and calculate a score base on their combination. The human evaluation results of the 4 generated monolingual datasets are shown in the following table:
Language | Our dataset | Tatoeba |
---|---|---|
en-en | 0.256 | 0.307 |
de-de | 0.291 | 0.588 |
sl-sl | 0.271 | 0.015 |
cs-cs | 0.189 | 0.210 |
We train 6 different mt5 models, one for each of the datasets we have created. We refer to these models as mono- and multilingual models, even though they are originally multilingual mt5 models, because we train them on the generated mono- and multilingual datasets.
Our trained models can be accessed on huggingface:
- MT5_small-enen
- MT5_small-dede
- MT5_small-slsl
- MT5_small-cscs
- MT5_small-multi_all
- MT5_small-multi_small
We used the Parascore metric to evaluate all models. The Parascore evaluation results of the 4 monolingual trained models are shown in the following table:
Language | Parascore score |
---|---|
en-en | 0.961 |
de-de | 0.925 |
sl-sl | 0.890 |
cs-cs | 0.922 |
The Parascore evaluation results of the 2 multilingual trained models are shown in the following table, which also shows the average scores for each of the 4 different language subparts of the test split of the multilingual dataset:
Language | Score multi-small | Score multi-all |
---|---|---|
whole test set | 0.925 | 0.925 |
English part | 0.938 | 0.939 |
German part | 0.926 | 0.925 |
Slovene part | 0.915 | 0.914 |
Czech part | 0.922 | 0.922 |
- Nikolay Vasilev
- Jannik Weiß (YAWNICK)
- Jan Jenicek (hjeni)