数据集:
projecte-aina/viquiquad
任务:
问答子任务:
extractive-qa语言:
ca计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-sa-4.0ViquiQuAD, An extractive QA dataset for Catalan, from the Wikipedia.
This dataset contains 3111 contexts extracted from a set of 597 high quality original (no translations) articles in the Catalan Wikipedia " Viquipèdia ", and 1 to 5 questions with their answer for each fragment.
Viquipedia articles are used under CC-by-sa licence.
This dataset can be used to fine-tune and evaluate extractive-QA and Language Models.
Extractive-QA, Language Model
The dataset is in Catalan ( ca-CA ).
{ 'id': 'P_66_C_391_Q1', 'title': 'Xavier Miserachs i Ribalta', 'context': "En aquesta època es va consolidar el concepte modern del reportatge fotogràfic, diferenciat del fotoperiodisme[n. 2] i de la fotografia documental,[n. 3] pel que fa a l'abast i el concepte. El reportatge fotogràfic implica més la idea de relat: un treball que vol més dedicació de temps, un esforç d'interpretació d'una situació i que culmina en un conjunt d'imatges. Això implica, d'una banda, la reivindicació del fotògraf per opinar, fet que li atorgarà estatus d'autor; l'autor proposa, doncs, una interpretació pròpia de la realitat. D'altra banda, el consens que s'estableix entre la majoria de fotògrafs és que el vehicle natural de la imatge fotogràfica és la pàgina impresa. Això suposà que revistes com Life, Paris-Match, Stern o Época assolissin la màxima esplendor en aquest període.", 'question': 'De què es diferenciava el reportatge fotogràfic?', 'answers': [{ 'text': 'del fotoperiodisme[n. 2] i de la fotografia documental', 'answer_start': 92 }] }
Follows Rajpurkar, Pranav et al. (2016) for SQuAD v1 datasets.
We hope this dataset contributes to the development of language models in Catalan, a low-resource language.
The source data are scraped articles from the Catalan wikipedia site.
From a set of high quality, non-translation, articles in the Catalan Wikipedia, 597 were randomly chosen, and from them 3111, 5-8 sentence contexts were extracted. We commissioned creation of between 1 and 5 questions for each context, following an adaptation of the guidelines from SQuAD 1.0 ( Rajpurkar, Pranav et al. (2016) ). In total, 15153 pairs of a question and an extracted fragment that contains the answer were created.
For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines.
Who are the source language producers?Volunteers who collaborate with Catalan Wikipedia.
We commissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQuAD 1.0 ( Rajpurkar, Pranav et al. (2016) ).
Who are the annotators?Annotation was commissioned to an specialized company that hired a team of native language speakers.
No personal or sensitive information included.
We hope this dataset contributes to the development of language models in Catalan, a low-resource language.
[N/A]
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
This work is licensed under a Attribution-ShareAlike 4.0 International License .
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }
[N/A]