数据集:
projecte-aina/catalan_government_crawling
任务:
填充掩码语言:
ca计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:2107.07903许可:
cc0-1.0The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the web. It has been obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government during September and October 2020. It consists of 39,117,909 tokens, 1,565,433 sentences and 71,043 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus.
This corpus is mainly intended to pretrain language models and word representations.
The dataset is in Catalan ( ca-CA ).
{ 'text': 'Títol: Estudi de tres marededéus del bisbat de Solsona\nResponsables del projecte: Pep Paret conservador–restaurador de l\'Àrea de Pintura i Escultura sobre fusta del CRBMC\nL\'objecte d\'aquest est udi és un millor coneixement de l\'estat de conservació del patrimoni moble català, en concret de tres escultures romàniques del bisbat de Solsona.\nEs du a terme un estudi científic de tres marededéus del bisb at de Solsona: la Mare de Déu de Queralt, la Mare de Déu de Coaner i la Mare de Déu de la Quar.\nLes imatges originals són romàniques, però totes elles han patit modificacions estructurals...' }
The dataset contains a single split: train .
We created this corpus to contribute to the development of language models in Catalan, a low-resource language.
The corpus has been obtained by crawling the all the .gencat.cat domains during July 2020. For preprocessing we used Corpus-Cleaner , a modular Python-based toolkit to clean raw text corpora through generator pipelines.
Who are the source language producers?The data comes from the official Catalan Government websites.
The dataset is unannotated.
Annotation process[N/A]
Who are the annotators?[N/A]
Since all data comes from public websites, no anonymisation process was performed.
We hope this corpus contributes to the development of language models in Catalan, a low-resource language.
We are aware that since the data comes from public web pages, some biases may be present in the dataset. Nonetheless, we have not applied any steps to reduce their impact.
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
Creative Commons CC0 1.0 Universal .
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", eprint={2107.07903}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @albertvillanova for adding this dataset.