数据集:
BSC-LT/SQAC
任务:
问答子任务:
extractive-qa语言:
es计算机处理:
monolingual语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-sa-4.0⚠️NOTICE⚠️: THIS MODEL HAS BEEN MOVED TO THE FOLLOWING URL AND WILL SOON BE REMOVED: https://huggingface.co/datasets/PlanTL-GOB-ES/SQAC
@article{DBLP:journals/corr/abs-2107-07253, author = {Asier Guti{\'{e}}rrez{-}Fandi{\~{n}}o and Jordi Armengol{-}Estap{\'{e}} and Marc P{\`{a}}mies and Joan Llop{-}Palao and Joaqu{\'{\i}}n Silveira{-}Ocampo and Casimiro Pio Carrino and Aitor Gonzalez{-}Agirre and Carme Armentano{-}Oller and Carlos Rodr{\'{\i}}guez Penagos and Marta Villegas}, title = {Spanish Language Models}, journal = {CoRR}, volume = {abs/2107.07253}, year = {2021}, url = {https://arxiv.org/abs/2107.07253}, archivePrefix = {arXiv}, eprint = {2107.07253}, timestamp = {Wed, 21 Jul 2021 15:55:35 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2107-07253.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
See the pre-print version of our paper for further details: https://arxiv.org/abs/2107.07253
This dataset contains 6,247 contexts and 18,817 questions with their answers, 1 to 5 for each fragment.
The sources of the contexts are:
This dataset can be used to build extractive-QA.
Extractive-QA
ES - Spanish
JSON files
Follows (Rajpurkar, Pranav et al., 2016) for squad v1 datasets. (see below for full reference). We added a field "source" with the source of the context.
{ "data": [ { "paragraphs": [ { "context": "Al cogote, y fumando como una cafetera. Ah!, no era él, éramos todos nosotros. Luego llegó Billie Holiday. Bajo el epígrafe Arte, la noche temática, pasaron la vida de la única cantante del universo que no es su voz, sino su alma lo que se escucha cuando interpreta. Gata golpeada por el mundo, pateada, violada, enganchada a todos los paraísos artificiales del planeta, jamás encontró el Edén. El Edén lo encontramos nosotros cuando, al concluir la sesión de la tele, pusimos en la doméstica cadena de sonido el mítico Last Recording, su última grabación (marzo de 1959), con la orquesta de Ray Ellis y el piano de Hank Jones. Se estaba muriendo Lady Day, y no obstante, mientras moría, su alma cantaba, Baby, won't you please come home. O sea, niño, criatura, amor, vuelve, a casa por favor.", "qas": [ { "question": "¿Quién se incorporó a la reunión más adelante?", "id": "c5429572-64b8-4c5d-9553-826f867b07be", "answers": [ { "answer_start": 91, "text": "Billie Holiday" } ] }, ... ] } ], "title": "P_129_20010702_&_P_154_20010102_&_P_108_20000301_c_&_P_108_20000601_d", "source": "ancora" }, ... ] }
46.38 of the words in the Question can be found in the Context.
Question | Count | % |
---|---|---|
qué | 6,381 | 33.91 % |
quién/es | 2,952 | 15.69 % |
cuál/es | 2,034 | 10.81 % |
cómo | 1,949 | 10.36 % |
dónde | 1,856 | 9.86 % |
cuándo | 1,639 | 8.71 % |
cuánto | 1,311 | 6.97 % |
cuántos | 495 | 2.63 % |
adónde | 100 | 0.53 % |
cuánta | 49 | 0.26 % |
no question mark | 43 | 0.23 % |
cuántas | 19 | 0.10 % |
6,247 contexts were randomly chosen from the three corpus described below. We commisioned the creation of between 1 and 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016) . In total, 18,817 pairs of a question and an extracted fragment that contains the answer were created.
For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines. We also created another QA dataset with Wikipedia to ensure thematic and stylistic variety.
The source data are scraped articles from the Spanish Wikipedia site, Wikinews site and from AnCora corpus.
Who are the source language producers?[More Information Needed]
We commissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016) .
Who are the annotators?Native language speakers.
Carlos Rodríguez and Carme Armentano, from BSC-CNS.
No personal or sensitive information included.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Carlos Rodríguez-Penagos or Carme Armentano-Oller ( bsc-temu@bsc.es )
This work was partially funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
This work is licensed under a Attribution-ShareAlike 4.0 International License .