数据集:

PlanTL-GOB-ES/SQAC

任务:

问答

子任务:

extractive-qa

语言:

计算机处理:

monolingual

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1606.05250

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

SQAC (Spanish Question-Answering Corpus)

Dataset Description

SQAC is an extractive QA dataset for the Spanish language.

Paper: MarIA: Spanish Language Models
Point of Contact: carlos.rodriguez1@bsc.es
Leaderboard: [EvalEs] ( https://plantl-gob-es.github.io/spanish-benchmark/ )

Dataset Summary

Contains 6,247 contexts and 18,817 questions with their respective answers, 1 to 5 for each fragment.

The sources of the contexts are:

Encyclopedic articles from the Spanish Wikipedia , used under CC-by-sa licence .
News articles from Wikinews , used under CC-by licence .
Newswire and literature text from the AnCora corpus , used under CC-by licence .

Supported Tasks

Extractive-QA

Languages

Spanish (es)

Directory Structure

README.md
SQAC.py
dev.json
test.json
train.json

Dataset Structure

Data Instances

{
    'id': '6cf3dcd6-b5a3-4516-8f9e-c5c1c6b66628', 
    'title': 'Historia de Japón', 
    'context': 'La historia de Japón (日本の歴史 o 日本史, Nihon no rekishi / Nihonshi?) es la sucesión de hechos acontecidos dentro del archipiélago japonés. Algunos de estos hechos aparecen aislados e influenciados por la naturaleza geográfica de Japón como nación insular, en tanto que otra serie de hechos, obedece a influencias foráneas como en el caso del Imperio chino, el cual definió su idioma, su escritura y, también, su cultura política. Asimismo, otra de las influencias foráneas fue la de origen occidental, lo que convirtió al país en una nación industrial, ejerciendo con ello una esfera de influencia y una expansión territorial sobre el área del Pacífico. No obstante, dicho expansionismo se detuvo tras la Segunda Guerra Mundial y el país se posicionó en un esquema de nación industrial con vínculos a su tradición cultural.', 
    'question': '¿Qué influencia convirtió Japón en una nación industrial?', 
    'answers': {
        'text': ['la de origen occidental'], 
        'answer_start': [473]
    }
}

Data Fields

{
  id: str
  title: str
  context: str
  question: str
  answers: {
    answer_start: [int]
    text: [str]
  }
}

Data Splits

Split	Size
train	15,036
dev	1,864
test	1.910

Content analysis

Number of articles, paragraphs and questions

Number of articles: 3,834
Number of contexts: 6,247
Number of questions: 18,817
Number of sentences: 48,026
Questions/Context ratio: 3.01
Sentences/Context ratio: 7.70

Number of tokens

Total tokens in context: 1,561,616
Average tokens/context: 250
Total tokens in questions: 203,235
Average tokens/question: 10.80
Total tokens in answers: 90,307
Average tokens/answer: 4.80

Lexical variation

46.38% of the words in the Question can be found in the Context.

Question type

Question	Count	%
qué	6,381	33.91 %
quién/es	2,952	15.69 %
cuál/es	2,034	10.81 %
cómo	1,949	10.36 %
dónde	1,856	9.86 %
cuándo	1,639	8.71 %
cuánto	1,311	6.97 %
cuántos	495	2.63 %
adónde	100	0.53 %
cuánta	49	0.26 %
no question mark	43	0.23 %
cuántas	19	0.10 %

Dataset Creation

Curation Rationale

For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al.) .

Source Data

Initial Data Collection and Normalization

The source data are scraped articles from Wikinews, the Spanish Wikipedia and the AnCora corpus.

Who are the source language producers?

Contributors to the aforementioned sites.

Annotations

Annotation process

We commissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al.) .

Who are the annotators?

Native language speakers.

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

This corpus contributes to the development of language models in Spanish.

Discussion of Biases

No postprocessing steps were applied to mitigate potential social biases.

Additional Information

Dataset Curators

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es ).

For further information, send an email to ( plantl-gob-es@bsc.es ).

This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL .

Licensing information

This work is licensed under CC Attribution 4.0 International License.

Citation Information

@article{maria,
    author = {Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquin Silveira-Ocampo and Casimiro Pio Carrino and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Aitor Gonzalez-Agirre and Marta Villegas},
    title = {MarIA: Spanish Language Models},
    journal = {Procesamiento del Lenguaje Natural},
    volume = {68},
    number = {0},
    year = {2022},
    issn = {1989-7553},
    url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405},
    pages = {39--60}
}

Contributions

[N/A]

作者:

PlanTL-GOB-ES

数据集大小:

13.16 MB