数据集:

carolina-c4ai/corpus-carolina

中文

Dataset Card for Corpus Carolina

Dataset Summary

Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-2021). This corpus contains documents and texts extracted from the web and includes information (metadata) about its provenance and tipology.

The documents are clustered into taxonomies and the corpus can be loaded in complete or taxonomy modes. To load a single taxonomy, it is possible to pass a code as a parameter to the loading script (see the example bellow). Codes are 3-letters string and possible values are:

  • dat : datasets and other corpora;
  • jud : judicial branch;
  • leg : legislative branch;
  • pub : public domain works;
  • soc : social media;
  • uni : university domains;
  • wik : wikis.

Dataset Vesioning:

The Carolina Corpus is under continuous development resulting in multiple vesions. The current version is v1.2, but v1.1 is also available. You can access diferent vesions of the corpus using the revision parameter on load_dataset .

Usage Example:

from datasets import load_dataset

# to load all taxonomies
corpus_carolina = load_dataset("carolina-c4ai/corpus-carolina")

# to load social media documents
social_media = load_dataset("carolina-c4ai/corpus-carolina", taxonomy="soc")

# to load previous version
corpus_carolina = load_dataset("carolina-c4ai/corpus-carolina", revision="v1.1")

Supported Tasks

Carolina corpus was compiled for academic purposes, namely linguistic and computational analysis.

Languages

Contemporary Brazilian Portuguese (1970-2021).

Dataset Structure

Files are stored inside corpus folder with a subfolder for each taxonomy. Every file folows a XML structure (TEI P5) and contains multiple extracted documents. For each document, the text and metadata are exposed as text and meta features, respectively.

Data Instances

Every instance have the following structure.

{
    "meta": datasets.Value("string"),
    "text": datasets.Value("string")
}
Code Taxonomy Instances Size
Total 2107045 11 GB
dat Datasets and other Corpora 1102049 4.4 GB
wik Wikis 960139 5.2 GB
jud Judicial Branch 40464 1.5 GB
leg Legislative Branch 13 25 MB
soc Social Media 3413 17 MB
uni University Domains 941 10 MB
pub Public Domain Works 26 4.5 MB

Data Fields

  • meta : a XML string with a TEI conformant teiHeader tag. It is exposed as text and needs to be parsed in order to access the actual metada;
  • text : a string containing the extracted document.

Data Splits

As a general corpus, Carolina does not have splits. In order to load the dataset, it is used corpus as its single split.

Additional Information

Dataset Curators

The Corpus Carolina is developed by a multidisciplinary team of linguists and computer scientists, members of the Virtual Laboratory of Digital Humanities - LaViHD and the Artificial Intelligence Center of the University of São Paulo - C4AI.

Licensing Information

The Open Corpus for Linguistics and Artificial Intelligence (Carolina) was compiled for academic purposes, namely linguistic and computational analysis. It is composed of texts assembled in various digital repositories, whose licenses are multiple and therefore should be observed when making use of the corpus. The Carolina headers are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International."

Citation Information

@misc{corpusCarolinaV1.1,
    title={
        Carolina:
        The Open Corpus for Linguistics and Artificial Intelligence
    },
    author={
        Finger, Marcelo and
        Paixão de Sousa, Maria Clara and
        Namiuti, Cristiane and
        Martins do Monte, Vanessa and
        Costa, Aline Silva and
        Serras, Felipe Ribas and
        Sturzeneker, Mariana Lourenço and
        Guets, Raquel de Paula and
        Mesquita, Renata Morais and
        Mello, Guilherme Lamartine de and
        Crespo, Maria Clara Ramos Morales and
        Rocha, Maria Lina de Souza Jeannine and
        Brasil, Patrícia and
        Silva, Mariana Marques da and
        Palma, Mayara Feliciano
    },
    howpublished={\url{
        https://sites.usp.br/corpuscarolina/corpus}},
    year={2022},
    note={Version 1.1 (Ada)},
}