Dataset Card for "Wikiomnia"

Dataset Summary

We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).

WikiOmnia consists of 2 parts:

the voluminous, automatically generated part: 15,9 million triplets consisting of the original article summary, a corresponding generated question and a generated answer;

the filtered part: the subsample of 3,5 million triplets, fully verified with automatic means

Wikiomnia adheres to a standard SQuAD format problem, resulting in triplets "text paragraph - question based on paragraph - answer from the paragraph", see the following example:

Original Wikipedia paragraph : Коити Масимо (яп. Масимо Ко:ити) — известный режиссёр аниме и основатель японской анимационной студии Bee Train. С момента основания студии он руководит производством почти всех её картин, а также время от времени принимает участие в работе над анимацией и музыкой.

English translation : Koichi Mashimo is a famous anime director and the founder of the Japanese animation studio Bee Train. Since the creation of the studio, he directed almost all studio’s works, and he also sometimes participates in art and sound tasks.

Generated question (ruT5) : Кто является основателем японской анимационной студии Bee Train?

Generated answer (ruT5) : Коити Масимо

English QA translation : Who is the founder of the Japanese animation studio Bee Train? Koichi Mashimo

Dataset Creation

Models used for dataset generation:

ruT5 large fine-tuned on SberQuaD
ruGPT-3 XL fine-tuned on SberQuaD
ruBERT DeepPavlov tuned for QA tasks

Source: Wikipedia version March 2021

Special tokens: <[TEXT]>, <[QUESTION]>, <[ANSWER]>

The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5- large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).

Additional Information

Licensing Information

Apache 2.0 license

Citation Information

@inproceedings{pisarevskaya-shavrina-2022-wikiomnia,
    title = "{W}iki{O}mnia: filtration and evaluation of the generated {QA} corpus on the whole {R}ussian {W}ikipedia",
    author = "Pisarevskaya, Dina  and
      Shavrina, Tatiana",
    booktitle = "Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.gem-1.10",
    pages = "125--135",
    abstract = "The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. Compiling factual questions datasets requires manual annotations, limiting the training data{'}s potential size. We present the WikiOmnia dataset, a new publicly available set of QA pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generation and filtration pipeline. To ensure high quality of generated QA pairs, diverse manual and automated evaluation techniques were applied. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).",
}

Contributions

Thanks to @Deenochka , @TatianaShavrina

作者:

RussianNLP

数据集大小:

32.97 GB