Skip to content

Latest commit

 

History

History
133 lines (101 loc) · 10 KB

README.md

File metadata and controls

133 lines (101 loc) · 10 KB

RuCoLA

RuCoLA logo

The Russian Corpus of Linguistic Acceptability (RuCoLA) is a dataset consisting of Russian language sentences with their binary acceptability judgements. It includes expert-written sentences from linguistic publications and machine-generated examples.

The corpus covers a variety of language phenomena, ranging from syntax and semantics to generative model hallucinations. We release RuCoLA to facilitate the development of methods for identifying errors in natural language and create a public leaderboard to track the progress on this problem.

The dataset is available on HuggingFace and in the data/ folder of the repository.

Read more about RuCoLA in the corresponding blog post (in Russian).

Sources

Linguistic publications and resources
Original source Transliterated source Source id
Проект корпусного описания русской грамматики Proekt korpusnogo opisaniya russkoj grammatiki Rusgram
Тестелец, Я.Г., 2001. Введение в общий синтаксис. Федеральное государственное бюджетное образовательное учреждение высшего образования Российский государственный гуманитарный университет. Yakov Testelets. 2001. Vvedeniye v obschiy sintaksis. Russian State University for the Humanities. Testelets
Лютикова, Е.А., 2010. К вопросу о категориальном статусе именных групп в русском языке. Вестник Московского университета. Серия 9. Филология, (6), pp.36-76. Ekaterina Lutikova. 2010. K voprosu o kategorial’nom statuse imennykh grup v russkom yazyke. Moscow University Philology Bulletin. Lutikova
Митренина, О.В., Романова, Е.Е. and Слюсарь, Н.А., 2017. Введение в генеративную грамматику. Общество с ограниченной ответственностью "Книжный дом ЛИБРОКОМ". Olga Mitrenina et al. 2017. Vvedeniye v generativnuyu grammatiku. Limited Liability Company “LIBROCOM”. Mitrenina
Падучева, Е.В., 2004. Динамические модели в семантике лексики. М.: Языки славянской культуры. Elena Paducheva. 2004. Dinamicheskiye modeli v semantike leksiki. Languages of Slavonic culture. Paducheva2004
Падучева, Е.В., 2010. Семантические исследования: Семантика времени и вида в русском языке; Семантика нарратива. М.: Языки славянской культуры. Elena Paducheva. 2010. Semanticheskiye issledovaniya: Semantika vremeni i vida v russkom yazyke; Semantika narrativa. Languages of Slavonic culture. Paducheva2010
Падучева, Е.В., 2013. Русское отрицательное предложение. М.: Языки славянской культуры Elena Paducheva. 2013. Russkoye otritsatel’noye predlozheniye. Languages of Slavonic culture. Paducheva2013
Селиверстова, О.Н., 2004. Труды по семантике. М.: Языки славянской культуры Olga Seliverstova. 2004. Trudy po semantike. Languages of Slavonic culture. Seliverstova
Набор данных ЕГЭ по русскому языку Shavrina et al. 2020. Humans Keep It One Hundred: an Overview of AI Journey USE5, USE7, USE8
Machine-generated sentences

Datasets

Original source Source id
Mikel Artetxe and Holger Schwenk. 2019. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Tatoeba
Holger Schwenk et al. 2021. WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia WikiMatrix
Ye Qi et al. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? TED
Alexandra Antonova and Alexey Misyurev. 2011. Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text YandexCorpus

Models

EasyNMT models:

  1. OPUS-MT. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – Building open translation services for the World
  2. M-BART50. Yuqing Tang et al. 2020. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
  3. M2M-100. Angela Fan et al. 2021. Beyond English-Centric Multilingual Machine Translation

Paraphrase generation models:

  1. ruGPT2-Large
  2. ruT5
  3. mT5. Linting Xue et al. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Baselines

Requirements

To set up the environment, you will need to install a recent version of PyTorch, scikit-learn and Transformers. The requirements.txt file contains the list of top-level packages used to train all models below.

Models

Our code supports acceptability classification evaluation for the following scikit-learn and Hugging Face🤗 Transformers models:

Running experiments

To train acceptability classifiers on RuCoLA, run the following commands.

  • scikit-learn models:
python baselines/train_sklearn_baselines.py -m {majority, logreg}
  • Masked Language Model (e.g. ruBERT or XLM-R) finetuning:
python baselines/finetune_mlm.py -m [MODEL_NAME]
  • ruT5 finetuning:
python baselines/finetune_t5.py -m [MODEL_NAME]

Afterwards, you can get test set predictions in the format required by the leaderboard for all trained models. To do this, run python baselines/get_csv_predictions.py -m MODEL1 MODEL2 ....

Get finetuned models

If you want to use an acceptability classifier trained on RuCoLA for your applications without training it yourself, we provide the following checkpoints available on the Hugging Face Hub:

Contact us

For any questions about RuCoLA or this website, write to [email protected] or create an issue in this repository.

You can also join the official Telegram chat to discuss your ideas with others and to follow the latest updates of the project.

Cite

Our paper is accepted to the EMNLP 2022 main conference.

If you use RuCoLA in your research, please cite the dataset using the following entry:

@inproceedings{mikhailov-etal-2022-rucola,
    title = "{R}u{C}o{LA}: {R}ussian Corpus of Linguistic Acceptability",
    author = "Mikhailov, Vladislav  and
      Shamardina, Tatiana  and
      Ryabinin, Max  and
      Pestova, Alena  and
      Smurov, Ivan  and
      Artemova, Ekaterina",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.348",
    pages = "5207--5227",
    abstract = "Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers.However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources.To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation.Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches.In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard to assess the linguistic competence of language models for Russian.",
}

License

Our baseline code and acceptability labels are available under the Apache 2.0 license. The copyright (where applicable) of texts from the linguistic publications and resources remains with the original authors or publishers.