cc-by-sa-4.0Professional translation into Catalan of XQuAD dataset .
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 ( Rajpurkar, Pranav et al., 2016 ) together with their professional translations into ten language: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Rumanian was added later. We added the 13th language to the corpus using also professional native Catalan translators.
XQuAD and XQuAD-Ca datasets are released under CC-by-sa licence.
Cross-lingual-QA, Extractive-QA, Language Model
The dataset is in Catalan ( ca-CA )
One json file.
1189 examples.
{ "data": [ { "context": "Al llarg de la seva existència, Varsòvia ha estat una ciutat multicultural. Segons el cens del 1901, de 711.988 habitants, el 56,2 % eren catòlics, el 35,7 % jueus, el 5 % cristians ortodoxos grecs i el 2,8 % protestants. Vuit anys després, el 1909, hi havia 281.754 jueus (36,9 %), 18.189 protestants (2,4 %) i 2.818 mariavites (0,4 %). Això va provocar que es construïssin centenars de llocs de culte religiós a totes les parts de la ciutat. La majoria d’ells es van destruir després de la insurrecció de Varsòvia del 1944. Després de la guerra, les noves autoritats comunistes de Polònia van apocar la construcció d’esglésies i només se’n va construir un petit nombre.", "qas": [ { "answers": [ { "text": "711.988", "answer_start": 104 } ], "id": "57338007d058e614000b5bdb", "question": "Quina era la població de Varsòvia l’any 1901?" }, { "answers": [ { "text": "56,2 %", "answer_start": 126 } ], "id": "57338007d058e614000b5bdc", "question": "Dels habitants de Varsòvia l’any 1901, quin percentatge era catòlic?" }, ... ] } ] }, ... ] }
Follows Rajpurkar, Pranav et al., 2016 for SQuAD v1 datasets.
We created this dataset to contribute to the development of language models in Catalan, a low-resource language, and for compatibility with similar datasets in other languages, and to allow inter-lingual comparisons.
This dataset is a professional translation of XQuAD into Catalan, commissioned by BSC TeMU within Projecte AINA .
For more information on how XQuAD was created, refer to the paper, On the Cross-lingual Transferability of Monolingual Representations , or visit the XQuAD's webpage .
Who are the source language producers?For more information on how XQuAD was created, refer to the paper, On the Cross-lingual Transferability of Monolingual Representations , or visit the XQuAD's webpage .
This is a professional translation of the XQuAD corpus and its annotations.
Annotation process[N/A]
Who are the annotators?Translation was commissioned to a professional translation company.
No personal or sensitive information included.
This dataset contributes to the development of language models in Catalan, a low-resource language.
Carlos Rodríguez-Penagos ( carlos.rodriguez1@bsc.es ) and Carme Armentano-Oller ( carme.armentano@bsc.es ) from BSC-CNS .
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }