数据集:
projecte-aina/catalan_textual_corpus
任务:
填充掩码语言:
ca计算机处理:
monolingual大小:
10M<n<100M语言创建人:
found批注创建人:
no-annotation预印本库:
arxiv:2107.07903许可:
cc-by-sa-4.0The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources.
It consists of 1,758,388,896 tokens, 73,172,152 sentences, and 12,556,365 documents. Documents are separated by single new lines. These boundaries have been preserved as long as the license allowed it.
This corpus is mainly intended to pretrain language models and word representations.
The dataset is in Catalan ( ca-CA ).
{'text': "L'operatiu continuarà durant aquest divendres."}
The dataset contains a single split: train .
We created this corpus to contribute to the development of language models in Catalan, a low-resource language.
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources: existing corpora such as DOGC, CaWac (non-dedup version), Oscar (unshuffled version), Open Subtitles, Catalan Wikipedia, and three brand new crawlings: the Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains; the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government; and the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the Catalan News Agency. For preprocessing we used Corpus-Cleaner , a modular Python-based toolkit to clean raw text corpora through generator pipelines.
Who are the source language producers?The original data comes from various sources: existing corpora and crawlings from public websites.
The dataset is unannotated.
Annotation process[N/A]
Who are the annotators?[N/A]
No anonymisation process was performed.
We hope this corpus contributes to the development of language models in Catalan, a low-resource language.
We are aware that since the data comes from unreliable web pages and multilingual crawled corpora, some biases may be present in the dataset. Nonetheless, we have not applied any steps to reduce their impact.
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
Creative Commons Attribution Share Alike 4.0 International .
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", eprint={2107.07903}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @albertvillanova for adding this dataset.