数据集:
carolina-c4ai/corpus-carolina
语言:
pt计算机处理:
monolingual大小:
1B<n<10B语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original许可:
cc-by-nc-sa-4.0Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-2021). This corpus contains documents and texts extracted from the web and includes information (metadata) about its provenance and tipology.
The documents are clustered into taxonomies and the corpus can be loaded in complete or taxonomy modes. To load a single taxonomy, it is possible to pass a code as a parameter to the loading script (see the example bellow). Codes are 3-letters string and possible values are:
Dataset Vesioning:
The Carolina Corpus is under continuous development resulting in multiple vesions. The current version is v1.2, but v1.1 is also available. You can access diferent vesions of the corpus using the revision parameter on load_dataset .
Usage Example:
from datasets import load_dataset # to load all taxonomies corpus_carolina = load_dataset("carolina-c4ai/corpus-carolina") # to load social media documents social_media = load_dataset("carolina-c4ai/corpus-carolina", taxonomy="soc") # to load previous version corpus_carolina = load_dataset("carolina-c4ai/corpus-carolina", revision="v1.1")
Carolina corpus was compiled for academic purposes, namely linguistic and computational analysis.
Contemporary Brazilian Portuguese (1970-2021).
Files are stored inside corpus folder with a subfolder for each taxonomy. Every file folows a XML structure (TEI P5) and contains multiple extracted documents. For each document, the text and metadata are exposed as text and meta features, respectively.
Every instance have the following structure.
{ "meta": datasets.Value("string"), "text": datasets.Value("string") }
Code | Taxonomy | Instances | Size |
---|---|---|---|
Total | 2107045 | 11 GB | |
dat | Datasets and other Corpora | 1102049 | 4.4 GB |
wik | Wikis | 960139 | 5.2 GB |
jud | Judicial Branch | 40464 | 1.5 GB |
leg | Legislative Branch | 13 | 25 MB |
soc | Social Media | 3413 | 17 MB |
uni | University Domains | 941 | 10 MB |
pub | Public Domain Works | 26 | 4.5 MB |
As a general corpus, Carolina does not have splits. In order to load the dataset, it is used corpus as its single split.
The Corpus Carolina is developed by a multidisciplinary team of linguists and computer scientists, members of the Virtual Laboratory of Digital Humanities - LaViHD and the Artificial Intelligence Center of the University of São Paulo - C4AI.
The Open Corpus for Linguistics and Artificial Intelligence (Carolina) was compiled for academic purposes, namely linguistic and computational analysis. It is composed of texts assembled in various digital repositories, whose licenses are multiple and therefore should be observed when making use of the corpus. The Carolina headers are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International."
@misc{corpusCarolinaV1.1, title={ Carolina: The Open Corpus for Linguistics and Artificial Intelligence }, author={ Finger, Marcelo and Paixão de Sousa, Maria Clara and Namiuti, Cristiane and Martins do Monte, Vanessa and Costa, Aline Silva and Serras, Felipe Ribas and Sturzeneker, Mariana Lourenço and Guets, Raquel de Paula and Mesquita, Renata Morais and Mello, Guilherme Lamartine de and Crespo, Maria Clara Ramos Morales and Rocha, Maria Lina de Souza Jeannine and Brasil, Patrícia and Silva, Mariana Marques da and Palma, Mayara Feliciano }, howpublished={\url{ https://sites.usp.br/corpuscarolina/corpus}}, year={2022}, note={Version 1.1 (Ada)}, }