数据集:
BSC-LT/viquiquad
语言:
ca预印本库:
arxiv:1606.05250If you use any of these resources (datasets or models) in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }
https://doi.org/10.5281/zenodo.4562345
This dataset contains 3111 contexts extracted from a set of 597 high quality original (no translations) articles in the Catalan Wikipedia "Viquipèdia" (ca.wikipedia.org), and 1 to 5 questions with their answer for each fragment.
Viquipedia articles are used under [CC-by-sa] ( https://creativecommons.org/licenses/by-sa/3.0/legalcode ) licence.
This dataset can be used to fine-tune and evaluate extractive-QA and Language Models. It is part of the Catalan Language Understanding Benchmark (CLUB) as presented in:
Armengol-Estapé J., Carrino CP., Rodriguez-Penagos C., de Gibert Bonet O., Armentano-Oller C., Gonzalez-Agirre A., Melero M. and Villegas M.,Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan". Findings of ACL 2021 (ACL-IJCNLP 2021).
Extractive-QA, Language Model
CA- Catalan
json files
Follows ((Rajpurkar, Pranav et al., 2016) for squad v1 datasets. (see below for full reference)
{ "data": [ { "title": "Frederick W. Mote", "paragraphs": [ { "context": "L'historiador Frederick W. Mote va escriure que l'ús del terme \\\\\\\\\\\\\\\\"classes socials\\\\\\\\\\\\\\\\" per a aquest sistema era enganyós i que la posició de les persones dins del sistema de quatre classes no era una indicació del seu poder social i riquesa reals, sinó que només implicava \\\\\\\\\\\\\\\\"graus de privilegi\\\\\\\\\\\\\\\\" als quals tenien dret institucionalment i legalment, de manera que la posició d'una persona dins de les classes no era una garantia de la seva posició, ja que hi havia xinesos rics i amb bona reputació social, però alhora hi havia menys mongols i semu rics que mongols i semu que vivien en la pobresa i eren maltractats.", "qas": [ { "answers": [ { "text": "Frederick W. Mote", "answer_start": 14 } ], "id": "5728848cff5b5019007da298", "question": "Qui creia que el sistema de classes socials de Yuan no s’hauria d’anomenar classes socials?" }, ... ] } ] }, ... ] }
train.development,test
After filtering (tokenization, stopwords, punctuation, case), 83,88% of the words in the question can be found in the Context
Question | Count | % |
---|---|---|
què | 4220 | 27.85 % |
qui | 2239 | 14.78 % |
com | 1964 | 12.96 % |
quan | 1133 | 7.48 % |
on | 1580 | 10.43 % |
quant | 925 | 6.1 % |
quin | 3399 | 22.43 % |
no question mark | 21 | 0.14 % |
From 100 randomly selected samples:
From a set of high quality, non-translation, articles in the Catalan Wikipedia (ca.wikipedia.org), 597 were randomly chosen, and from them 3111, 5-8 sentence contexts were extracted. We commissioned creation of between 1 and 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 [Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)], ( http://arxiv.org/abs/1606.05250 ). In total, 15153 pairs of a question and an extracted fragment that contains the answer were created.
For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines.
The source data are scraped articles from the Catalan wikipedia site ( https://ca.wikipedia.org ).
Who are the source language producers?[More Information Needed]
We commissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250 .
Who are the annotators?Native language speakers.
Carlos Rodríguez and Carme Armentano, from BSC-CNS
No personal or sensitive information included.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Carlos Rodríguez-Penagos or Carme Armentano-Oller ( bsc-temu@bsc.es )
This work is licensed under a Attribution-ShareAlike 4.0 International License .