数据集:

RussianNLP/rucola

中文

Dataset Card for Dataset Name

Dataset Summary

Russian Corpus of Linguistic Acceptability (RuCoLA) is a novel benchmark of 13.4k sentences labeled as acceptable or not. RuCoLA combines in-domain sentences manually collected from linguistic literature and out-of-domain sentences produced by nine machine translation and paraphrase generation models. The motivation behind the out-of-domain set is to facilitate the practical use of acceptability judgments for improving language generation. Each unacceptable sentence is additionally labeled with four standard and machine-specific coarse-grained categories: morphology, syntax, semantics, and hallucinations.

Dataset Structure

Supported Tasks and Leaderboards

Languages

Russian.

Data Instances

{
  "id": 19,
  "sentence": "Люк останавливает удачу от этого.",
  "label": 0,
  "error_type": "Hallucination",
  "detailed_source": "WikiMatrix"}
}

The example in English for illustration purposes:

{
  "id": 19,
  "sentence": "Luck stops luck from doing this.",
  "label": 0,
  "error_type": "Hallucination",
  "detailed_source": "WikiMatrix"}
}

Data Fields

  • id (int64) : the sentence's id.
  • sentence (str) : the sentence.
  • label (str) : the target class. "1" refers to "acceptable", while "0" corresponds to "unacceptable".
  • error_type (str) : the coarse-grained violation category (Morphology, Syntax, Semantics, or Hallucination); "0" if the sentence is acceptable.
  • detailed_source : the data source.

Data Splits

RuCoLA consists of the training, development, and private test sets organised under two subsets: in-domain (linguistic publications) and out-of-domain (texts produced by natural language generation models).

  • train : 7869 in-domain samples ( "data/in_domain_train.csv" ).
  • validation : 2787 in-domain and out-of-domain samples. The in-domain ( "data/in_domain_dev.csv" ) and out-of-domain ( "data/out_of_domain_dev.csv" ) validation sets are merged into "data/dev.csv" for convenience.
  • test : 2789 in-domain and out-of-domain samples ( "data/test.csv" ).

Dataset Creation

Curation Rationale

  • In-domain Subset: The in-domain sentences and the corresponding authors’ acceptability judgments are manually drawn from fundamental linguistic textbooks, academic publications, and methodological materials.
  • Out-of-domain Subset: The out-of-domain sentences are produced by nine open-source MT and paraphrase generation models.

Source Data

Linguistic publications and resources
Original source Transliterated source Source id
Проект корпусного описания русской грамматики Proekt korpusnogo opisaniya russkoj grammatiki Rusgram
Тестелец, Я.Г., 2001. Введение в общий синтаксис . Федеральное государственное бюджетное образовательное учреждение высшего образования Российский государственный гуманитарный университет. Yakov Testelets. 2001. Vvedeniye v obschiy sintaksis. Russian State University for the Humanities. Testelets
Лютикова, Е.А., 2010. К вопросу о категориальном статусе именных групп в русском языке . Вестник Московского университета. Серия 9. Филология, (6), pp.36-76. Ekaterina Lutikova. 2010. K voprosu o kategorial’nom statuse imennykh grup v russkom yazyke. Moscow University Philology Bulletin. Lutikova
Митренина, О.В., Романова, Е.Е. and Слюсарь, Н.А., 2017. Введение в генеративную грамматику . Общество с ограниченной ответственностью "Книжный дом ЛИБРОКОМ". Olga Mitrenina et al. 2017. Vvedeniye v generativnuyu grammatiku. Limited Liability Company “LIBROCOM”. Mitrenina
Падучева, Е.В., 2004. Динамические модели в семантике лексики . М.: Языки славянской культуры. Elena Paducheva. 2004. Dinamicheskiye modeli v semantike leksiki. Languages of Slavonic culture. Paducheva2004
Падучева, Е.В., 2010. Семантические исследования: Семантика времени и вида в русском языке; Семантика нарратива . М.: Языки славянской культуры. Elena Paducheva. 2010. Semanticheskiye issledovaniya: Semantika vremeni i vida v russkom yazyke; Semantika narrativa. Languages of Slavonic culture. Paducheva2010
Падучева, Е.В., 2013. Русское отрицательное предложение . М.: Языки славянской культуры Elena Paducheva. 2013. Russkoye otritsatel’noye predlozheniye. Languages of Slavonic culture. Paducheva2013
Селиверстова, О.Н., 2004. Труды по семантике . М.: Языки славянской культуры Olga Seliverstova. 2004. Trudy po semantike. Languages of Slavonic culture. Seliverstova
Набор данных ЕГЭ по русскому языку Shavrina et al. 2020. Humans Keep It One Hundred: an Overview of AI Journey USE5, USE7, USE8
Machine-generated sentences

Datasets

Original source Source id
Mikel Artetxe and Holger Schwenk. 2019. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Tatoeba
Holger Schwenk et al. 2021. WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia WikiMatrix
Ye Qi et al. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? TED
Alexandra Antonova and Alexey Misyurev. 2011. Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text YandexCorpus

Models

EasyNMT models :

  • OPUS-MT. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – Building open translation services for the World
  • M-BART50. Yuqing Tang et al. 2020. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
  • M2M-100. Angela Fan et al. 2021. Beyond English-Centric Multilingual Machine Translation
  • Paraphrase generation models :

  • ruGPT2-Large
  • ruT5
  • mT5. Linting Xue et al. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
  • Annotations

    Annotation process

    The out-of-domain sentences undergo a two-stage annotation procedure on Toloka , a crowd-sourcing platform for data labeling. Each stage includes an unpaid training phase with explanations, control tasks for tracking annotation quality, and the main annotation task. Before starting, the worker is given detailed instructions describing the task, explaining the labels, and showing plenty of examples. The instruction is available at any time during both the training and main annotation phases. To get access to the main phase, the worker should first complete the training phase by labeling more than 70% of its examples correctly. Each trained worker receives a page with five sentences, one of which is a control one. We collect the majority vote labels via a dy- namic overlap from three to five workers after filtering them by response time and performance on control tasks.

    • Stage 1: Acceptability Judgments The first annotation stage defines whether a given sentence is acceptable or not. Access to the project is granted to workers certified as native speakers of Russian by Toloka and ranked top-60% workers according to the Toloka rating system. Each worker answers 30 examples in the training phase. Each training example is accompanied by an explanation that appears in an incorrect answer. The main annotation phase counts 3.6k machine-generated sentences. The pay rate is on average $2.55/hr, which is twice the amount of the hourly minimum wage in Russia. Each of 1.3k trained workers get paid, but we keep votes from only 960 workers whose annotation quality rate on the control sentences is more than 50%.

    • Stage 2: Violation Categories The second stage includes validation and annotation of sentences labeled unacceptable on Stage 1 according to five answer options: “Morphology”, “Syntax”, “Semantics”, “Hallucinations” and “Other”. The task is framed as a multi-label classification, i.e., the sentence may contain more than one violation in some rare cases or be re-labeled as acceptable. We create a team of 30 annotators who are undergraduate BA and MA in philology and linguistics from several Russian universities. The students are asked to study the works on CoLA, TGEA, and hallucinations. We also hold an online seminar to discuss the works and clarify the task specifics. Each student undergoes platform-based training on 15 examples before moving onto the main phase of 1.3k sentences. The students are paid on average $5.42/hr and are eligible to get credits for an academic course or an internship. This stage provides direct interaction between authors and students in a group chat. We keep submissions with more than 30 seconds of response time per page and collect the majority vote labels for each answer independently. Sentences having more than one violation category or labeled as “Other” by the majority are filtered out.

    Personal and Sensitive Information

    The annotators are warned about potentially sensitive topics in data (e.g., politics, culture, and religion).

    Considerations for Using the Data

    Social Impact of Dataset

    RuCoLA may serve as training data for acceptability classifiers, which may benefit the quality of generated texts. We recognize that such improvements in text generation may lead to misuse of LMs for malicious purposes. However, our corpus can be used to train adversarial defense and artificial text detection models. We introduce a novel dataset for research and development needs , and the potential negative uses are not lost on us.

    Discussion of Biases

    Although we aim to control the number of high-frequency tokens in the RuCoLA’s sentences, we assume that potential word frequency distribution shift between LMs’ pretraining corpora and our corpus can introduce bias in the evaluation. Furthermore, linguistic publications represent a specific domain as the primary source of acceptability judgments. On the one hand, it can lead to a domain shift when using RuCoLA for practical purposes. On the other hand, we observe moderate acceptability classification performance on the out-of-domain test, which spans multiple domains, ranging from subtitles to Wikipedia.

    Other Known Limitations

    • Data Collection Acceptability judgments datasets require a source of unacceptable sentences. Collecting judgments from linguistic literature has become a standard practice replicated in multiple languages. However, this approach has several limitations. First, many studies raise concerns about the reliability and reproducibility of acceptability judgments. Second, the linguists’ judgments may limit data representativeness, as they may not reflect the errors that speakers tend to produce. Third, enriching acceptability judgments datasets is time-consuming, while creating new ones can be challenging due to limited resources, e.g., in low-resource languages.

    • Expert vs. Non-expert One of the open methodological questions on acceptability judgments is whether they should be collected from expert or non-expert speakers. On the one hand, prior linguistic knowledge can introduce bias in reporting judgments. On the other hand, expertise may increase the quality of the linguists’ judgments over the ones of non-linguists. At the same time, the latter tend to be influenced by an individual’s exposure to ungrammatical language use. The objective of involving students with a linguistic background is to maximize the annotation quality.

    • Fine-grained Annotation The coarse-grained annotation scheme of the RuCoLA’s unacceptable sentences relies on four major categories. While the annotation can be helpful for model error analysis, it limits the scope of LMs’ diagnostic evaluation concerning linguistic and machine-specific phenomena.

    Additional Information

    Dataset Curators

    Correspondence: vmikhailovhse@gmail.com

    Licensing Information

    Our baseline code and acceptability labels are available under the Apache 2.0 license. The copyright (where applicable) of texts from the linguistic publications and resources remains with the original authors or publishers.

    Citation Information

    @inproceedings{mikhailov-etal-2022-rucola,
        title = "{R}u{C}o{LA}: {R}ussian Corpus of Linguistic Acceptability",
        author = "Mikhailov, Vladislav  and
          Shamardina, Tatiana  and
          Ryabinin, Max  and
          Pestova, Alena  and
          Smurov, Ivan  and
          Artemova, Ekaterina",
        booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
        month = dec,
        year = "2022",
        address = "Abu Dhabi, United Arab Emirates",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.emnlp-main.348",
        pages = "5207--5227",
        abstract = "Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers.However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources.To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation.Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches.In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard to assess the linguistic competence of language models for Russian.",
    }
    

    Other

    Please refer to our paper for more details.