数据集:
projecte-aina/teca
任务:
文本分类语言:
ca计算机处理:
monolingual语言创建人:
found批注创建人:
expert-generated预印本库:
arxiv:2107.07903许可:
cc-by-nc-nd-4.0TE-ca is a dataset of textual entailment in Catalan, which contains 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction or neutral).
This dataset was developed by BSC TeMU as part of Projecte AINA , to enrich the Catalan Language Understanding Benchmark (CLUB) .
Textual entailment, Text classification, Language Model
The dataset is in Catalan ( ca-CA ).
Three JSON files, one for each split.
{ "id": 3247, "premise": "L'ONU adopta a Marràqueix un pacte no vinculant per les migracions", "hypothesis": "S'acorden unes recomanacions per les persones migrades a Marràqueix", "label": "0" }, { "id": 2825, "premise": "L'ONU adopta a Marràqueix un pacte no vinculant per les migracions", "hypothesis": "Les persones migrades seran acollides a Marràqueix", "label": "1" }, { "id": 2431, "premise": "L'ONU adopta a Marràqueix un pacte no vinculant per les migracions", "hypothesis": "L'acord impulsat per l'ONU lluny de tancar-se", "label": "2" },
We created this dataset to contribute to the development of language models in Catalan, a low-resource language.
Source sentences are extracted from the Catalan Textual Corpus and from VilaWeb newswire.
Initial Data Collection and Normalization12000 sentences from the BSC Catalan Textual Corpus , together with 6200 headers from the Catalan news site VilaWeb , were chosen randomly. We filtered them by different criteria, such as length and stand-alone intelligibility. For each selected text, we commissioned 3 hypotheses (one for each entailment category) to be written by a team of native annotators.
Some sentence pairs were excluded because of inconsistencies.
Who are the source language producers?The Catalan Textual Corpus corpus consists of several corpora gathered from web crawling and public corpora. More information can be found here .
VilaWeb is a Catalan newswire.
We commissioned 3 hypotheses (one for each entailment category) to be written by a team of annotators.
Who are the annotators?Annotators are a team of native language collaborators from two independent companies.
No personal or sensitive information included.
We hope this dataset contributes to the development of language models in Catalan, a low-resource language.
[N/A]
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
This work is licensed under an Attribution-NonCommercial-NoDerivatives 4.0 International License .
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }