数据集:

BSC-LT/viquiquad

语言:

预印本库:

arxiv:1606.05250

数据集介绍文件清单

中文

ViquiQuAD, An extractive QA dataset for catalan, from the Wikipedia

BibTeX citation

If you use any of these resources (datasets or models) in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

Digital Object Identifier (DOI) and access to dataset files

https://doi.org/10.5281/zenodo.4562345

Introduction

This dataset contains 3111 contexts extracted from a set of 597 high quality original (no translations) articles in the Catalan Wikipedia "Viquipèdia" (ca.wikipedia.org), and 1 to 5 questions with their answer for each fragment.

Viquipedia articles are used under [CC-by-sa] ( https://creativecommons.org/licenses/by-sa/3.0/legalcode ) licence.

This dataset can be used to fine-tune and evaluate extractive-QA and Language Models. It is part of the Catalan Language Understanding Benchmark (CLUB) as presented in:

Armengol-Estapé J., Carrino CP., Rodriguez-Penagos C., de Gibert Bonet O., Armentano-Oller C., Gonzalez-Agirre A., Melero M. and Villegas M.,Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan". Findings of ACL 2021 (ACL-IJCNLP 2021).

Supported Tasks and Leaderboards

Extractive-QA, Language Model

Languages

CA- Catalan

Directory structure

README
dev.json
test.json
train.json
viquiquad.py

Dataset Structure

Data Instances

json files

Data Fields

Follows ((Rajpurkar, Pranav et al., 2016) for squad v1 datasets. (see below for full reference)

Example:

{
  "data": [
    {
      "title": "Frederick W. Mote",
      "paragraphs": [
        {
          "context": "L'historiador Frederick W. Mote va escriure que l'ús del terme \\\\\\\\\\\\\\\\"classes socials\\\\\\\\\\\\\\\\" per a aquest sistema era enganyós i que la posició de les persones dins del sistema de quatre classes no era una indicació del seu poder social i riquesa reals, sinó que només implicava \\\\\\\\\\\\\\\\"graus de privilegi\\\\\\\\\\\\\\\\" als quals tenien dret institucionalment i legalment, de manera que la posició d'una persona dins de les classes no era una garantia de la seva posició, ja que hi havia xinesos rics i amb bona reputació social, però alhora hi havia menys mongols i semu rics que mongols i semu que vivien en la pobresa i eren maltractats.",
          "qas": [
            {
              "answers": [
                {
                  "text": "Frederick W. Mote",
                  "answer_start": 14
                }
              ],
              "id": "5728848cff5b5019007da298",
              "question": "Qui creia que el sistema de classes socials de Yuan no s’hauria d’anomenar classes socials?"
            },
            ...
          ]
        }
      ]
    }, 
    ...
   ]
}

Data Splits

train.development,test

Content analysis

Number of articles, paragraphs and questions

Number of articles: 597
Number of contexts: 3111
Number of questions: 15153
Questions/context: 4.87
Number of sentences in contexts: 15100
Sentences/context: 4.85

Number of tokens

tokens in context: 469335
tokens/context 150.86
tokens in questions: 145249
tokens/questions: 9.58
tokens in answers: 63246
tokens/answers: 4.17

Lexical variation

After filtering (tokenization, stopwords, punctuation, case), 83,88% of the words in the question can be found in the Context

Question type

Question	Count	%
què	4220	27.85 %
qui	2239	14.78 %
com	1964	12.96 %
quan	1133	7.48 %
on	1580	10.43 %
quant	925	6.1 %
quin	3399	22.43 %
no question mark	21	0.14 %

Question-answer relationships

From 100 randomly selected samples:

Lexical variation: 33.0%
World knowledge: 16.0%
Syntactic variation: 35.0%
Multiple sentence: 17.0%

Dataset Creation

Methodology

From a set of high quality, non-translation, articles in the Catalan Wikipedia (ca.wikipedia.org), 597 were randomly chosen, and from them 3111, 5-8 sentence contexts were extracted. We commissioned creation of between 1 and 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 [Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)], ( http://arxiv.org/abs/1606.05250 ). In total, 15153 pairs of a question and an extracted fragment that contains the answer were created.

Curation Rationale

For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines.

Source Data

https://ca.wikipedia.org

Initial Data Collection and Normalization

The source data are scraped articles from the Catalan wikipedia site ( https://ca.wikipedia.org ).

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

We commissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250 .

Who are the annotators?

Native language speakers.

Dataset Curators

Carlos Rodríguez and Carme Armentano, from BSC-CNS

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Contact

Carlos Rodríguez-Penagos or Carme Armentano-Oller ( bsc-temu@bsc.es )

License

This work is licensed under a Attribution-ShareAlike 4.0 International License .

作者:

BSC-LT

数据集大小:

4.96 MB