数据集:
BSC-LT/xquad-ca
语言:
ca预印本库:
arxiv:1910.11856If you use any of these resources (datasets or models) in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }
https://doi.org/10.5281/zenodo.4526224
Professional translation into Catalan of XQuAD dataset ( https://github.com/deepmind/xquad ).
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Rumanian was added later. We added the 13th language to the corpus using also professional native catalan translators.
XQuAD and XQuAD-Ca datasets are released under [CC-by-sa] ( https://creativecommons.org/licenses/by-sa/3.0/legalcode ) licence.
Cross-lingual-QA, Extractive-QA, Language Model
CA- Catalan
One json file
Follows ((Rajpurkar, Pranav et al., 2016) for SQuAD v1 datasets. (see below for full reference)
{ "data": [ { "context": "Al llarg de la seva existència, Varsòvia ha estat una ciutat multicultural. Segons el cens del 1901, de 711.988 habitants, el 56,2 % eren catòlics, el 35,7 % jueus, el 5 % cristians ortodoxos grecs i el 2,8 % protestants. Vuit anys després, el 1909, hi havia 281.754 jueus (36,9 %), 18.189 protestants (2,4 %) i 2.818 mariavites (0,4 %). Això va provocar que es construïssin centenars de llocs de culte religiós a totes les parts de la ciutat. La majoria d’ells es van destruir després de la insurrecció de Varsòvia del 1944. Després de la guerra, les noves autoritats comunistes de Polònia van apocar la construcció d’esglésies i només se’n va construir un petit nombre.", "qas": [ { "answers": [ { "text": "711.988", "answer_start": 104 } ], "id": "57338007d058e614000b5bdb", "question": "Quina era la població de Varsòvia l’any 1901?" }, { "answers": [ { "text": "56,2 %", "answer_start": 126 } ], "id": "57338007d058e614000b5bdc", "question": "Dels habitants de Varsòvia l’any 1901, quin percentatge era catòlic?" }, ... ] } ] }, ... ] }
One
For more information on how XQuAD was created, refer to the paper, On the Cross-lingual Transferability of Monolingual Representations ( https://arxiv.org/abs/1910.11856 ), or visit the webpage https://github.com/deepmind/xquad
Translation into Catalan was commissioned by BSC TeMU within the AINA project.
For compatibility with similar datasets in other languages, and to allow inter-lingual comparisons.
Professional translation of XQuAD into Catalan
Who are the source language producers?For more information on how XQuAD was created, refer to the paper, On the Cross-lingual Transferability of Monolingual Representations ( https://arxiv.org/abs/1910.11856 ), or visit the webpage https://github.com/deepmind/xquad
None
Who are the annotators?Translation whas commisioned to a professional translation company.
Carlos Rodríguez and Carme Armentano, from BSC-CNS
No personal or sensitive information included.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Carlos Rodríguez-Penagos or Carme Armentano-Oller ( bsc-temu@bsc.es )
This work is licensed under a Attribution-ShareAlike 4.0 International License .