数据集:
RussianNLP/rucola
Russian Corpus of Linguistic Acceptability (RuCoLA) is a novel benchmark of 13.4k sentences labeled as acceptable or not. RuCoLA combines in-domain sentences manually collected from linguistic literature and out-of-domain sentences produced by nine machine translation and paraphrase generation models. The motivation behind the out-of-domain set is to facilitate the practical use of acceptability judgments for improving language generation. Each unacceptable sentence is additionally labeled with four standard and machine-specific coarse-grained categories: morphology, syntax, semantics, and hallucinations.
Russian.
{ "id": 19, "sentence": "Люк останавливает удачу от этого.", "label": 0, "error_type": "Hallucination", "detailed_source": "WikiMatrix"} }
The example in English for illustration purposes:
{ "id": 19, "sentence": "Luck stops luck from doing this.", "label": 0, "error_type": "Hallucination", "detailed_source": "WikiMatrix"} }
RuCoLA consists of the training, development, and private test sets organised under two subsets: in-domain (linguistic publications) and out-of-domain (texts produced by natural language generation models).
Original source | Transliterated source | Source id |
---|---|---|
Проект корпусного описания русской грамматики | Proekt korpusnogo opisaniya russkoj grammatiki | Rusgram |
Тестелец, Я.Г., 2001. Введение в общий синтаксис . Федеральное государственное бюджетное образовательное учреждение высшего образования Российский государственный гуманитарный университет. | Yakov Testelets. 2001. Vvedeniye v obschiy sintaksis. Russian State University for the Humanities. | Testelets |
Лютикова, Е.А., 2010. К вопросу о категориальном статусе именных групп в русском языке . Вестник Московского университета. Серия 9. Филология, (6), pp.36-76. | Ekaterina Lutikova. 2010. K voprosu o kategorial’nom statuse imennykh grup v russkom yazyke. Moscow University Philology Bulletin. | Lutikova |
Митренина, О.В., Романова, Е.Е. and Слюсарь, Н.А., 2017. Введение в генеративную грамматику . Общество с ограниченной ответственностью "Книжный дом ЛИБРОКОМ". | Olga Mitrenina et al. 2017. Vvedeniye v generativnuyu grammatiku. Limited Liability Company “LIBROCOM”. | Mitrenina |
Падучева, Е.В., 2004. Динамические модели в семантике лексики . М.: Языки славянской культуры. | Elena Paducheva. 2004. Dinamicheskiye modeli v semantike leksiki. Languages of Slavonic culture. | Paducheva2004 |
Падучева, Е.В., 2010. Семантические исследования: Семантика времени и вида в русском языке; Семантика нарратива . М.: Языки славянской культуры. | Elena Paducheva. 2010. Semanticheskiye issledovaniya: Semantika vremeni i vida v russkom yazyke; Semantika narrativa. Languages of Slavonic culture. | Paducheva2010 |
Падучева, Е.В., 2013. Русское отрицательное предложение . М.: Языки славянской культуры | Elena Paducheva. 2013. Russkoye otritsatel’noye predlozheniye. Languages of Slavonic culture. | Paducheva2013 |
Селиверстова, О.Н., 2004. Труды по семантике . М.: Языки славянской культуры | Olga Seliverstova. 2004. Trudy po semantike. Languages of Slavonic culture. | Seliverstova |
Набор данных ЕГЭ по русскому языку | Shavrina et al. 2020. Humans Keep It One Hundred: an Overview of AI Journey | USE5, USE7, USE8 |
Datasets
Original source | Source id |
---|---|
Mikel Artetxe and Holger Schwenk. 2019. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond | Tatoeba |
Holger Schwenk et al. 2021. WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia | WikiMatrix |
Ye Qi et al. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? | TED |
Alexandra Antonova and Alexey Misyurev. 2011. Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text | YandexCorpus |
Models
Paraphrase generation models :
The out-of-domain sentences undergo a two-stage annotation procedure on Toloka , a crowd-sourcing platform for data labeling. Each stage includes an unpaid training phase with explanations, control tasks for tracking annotation quality, and the main annotation task. Before starting, the worker is given detailed instructions describing the task, explaining the labels, and showing plenty of examples. The instruction is available at any time during both the training and main annotation phases. To get access to the main phase, the worker should first complete the training phase by labeling more than 70% of its examples correctly. Each trained worker receives a page with five sentences, one of which is a control one. We collect the majority vote labels via a dy- namic overlap from three to five workers after filtering them by response time and performance on control tasks.
Stage 1: Acceptability Judgments The first annotation stage defines whether a given sentence is acceptable or not. Access to the project is granted to workers certified as native speakers of Russian by Toloka and ranked top-60% workers according to the Toloka rating system. Each worker answers 30 examples in the training phase. Each training example is accompanied by an explanation that appears in an incorrect answer. The main annotation phase counts 3.6k machine-generated sentences. The pay rate is on average $2.55/hr, which is twice the amount of the hourly minimum wage in Russia. Each of 1.3k trained workers get paid, but we keep votes from only 960 workers whose annotation quality rate on the control sentences is more than 50%.
Stage 2: Violation Categories The second stage includes validation and annotation of sentences labeled unacceptable on Stage 1 according to five answer options: “Morphology”, “Syntax”, “Semantics”, “Hallucinations” and “Other”. The task is framed as a multi-label classification, i.e., the sentence may contain more than one violation in some rare cases or be re-labeled as acceptable. We create a team of 30 annotators who are undergraduate BA and MA in philology and linguistics from several Russian universities. The students are asked to study the works on CoLA, TGEA, and hallucinations. We also hold an online seminar to discuss the works and clarify the task specifics. Each student undergoes platform-based training on 15 examples before moving onto the main phase of 1.3k sentences. The students are paid on average $5.42/hr and are eligible to get credits for an academic course or an internship. This stage provides direct interaction between authors and students in a group chat. We keep submissions with more than 30 seconds of response time per page and collect the majority vote labels for each answer independently. Sentences having more than one violation category or labeled as “Other” by the majority are filtered out.
The annotators are warned about potentially sensitive topics in data (e.g., politics, culture, and religion).
RuCoLA may serve as training data for acceptability classifiers, which may benefit the quality of generated texts. We recognize that such improvements in text generation may lead to misuse of LMs for malicious purposes. However, our corpus can be used to train adversarial defense and artificial text detection models. We introduce a novel dataset for research and development needs , and the potential negative uses are not lost on us.
Although we aim to control the number of high-frequency tokens in the RuCoLA’s sentences, we assume that potential word frequency distribution shift between LMs’ pretraining corpora and our corpus can introduce bias in the evaluation. Furthermore, linguistic publications represent a specific domain as the primary source of acceptability judgments. On the one hand, it can lead to a domain shift when using RuCoLA for practical purposes. On the other hand, we observe moderate acceptability classification performance on the out-of-domain test, which spans multiple domains, ranging from subtitles to Wikipedia.
Data Collection Acceptability judgments datasets require a source of unacceptable sentences. Collecting judgments from linguistic literature has become a standard practice replicated in multiple languages. However, this approach has several limitations. First, many studies raise concerns about the reliability and reproducibility of acceptability judgments. Second, the linguists’ judgments may limit data representativeness, as they may not reflect the errors that speakers tend to produce. Third, enriching acceptability judgments datasets is time-consuming, while creating new ones can be challenging due to limited resources, e.g., in low-resource languages.
Expert vs. Non-expert One of the open methodological questions on acceptability judgments is whether they should be collected from expert or non-expert speakers. On the one hand, prior linguistic knowledge can introduce bias in reporting judgments. On the other hand, expertise may increase the quality of the linguists’ judgments over the ones of non-linguists. At the same time, the latter tend to be influenced by an individual’s exposure to ungrammatical language use. The objective of involving students with a linguistic background is to maximize the annotation quality.
Fine-grained Annotation The coarse-grained annotation scheme of the RuCoLA’s unacceptable sentences relies on four major categories. While the annotation can be helpful for model error analysis, it limits the scope of LMs’ diagnostic evaluation concerning linguistic and machine-specific phenomena.
Correspondence: vmikhailovhse@gmail.com
Our baseline code and acceptability labels are available under the Apache 2.0 license. The copyright (where applicable) of texts from the linguistic publications and resources remains with the original authors or publishers.
@inproceedings{mikhailov-etal-2022-rucola, title = "{R}u{C}o{LA}: {R}ussian Corpus of Linguistic Acceptability", author = "Mikhailov, Vladislav and Shamardina, Tatiana and Ryabinin, Max and Pestova, Alena and Smurov, Ivan and Artemova, Ekaterina", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.348", pages = "5207--5227", abstract = "Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers.However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources.To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation.Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches.In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard to assess the linguistic competence of language models for Russian.", }
Please refer to our paper for more details.