数据集:

brwac

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for BrWaC

Dataset Summary

The BrWaC (Brazilian Portuguese Web as Corpus) is a large corpus constructed following the Wacky framework, which was made public for research purposes. The current corpus version, released in January 2017, is composed by 3.53 million documents, 2.68 billion tokens and 5.79 million types. Please note that this resource is available solely for academic research purposes, and you agreed not to use it for any commercial applications. Manually download at https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Portuguese

Dataset Structure

Data Instances

An example from the BrWaC dataset looks as follows:

{
  "doc_id": "netg-1afc73",
  "text": {
    "paragraphs": [
      [
        "Conteúdo recente"
      ],
      [
        "ESPUMA MARROM CHAMADA \"NINGUÉM MERECE\""
      ],
      [
        "31 de Agosto de 2015, 7:07 , por paulo soavinski - | No one following this article yet."
      ],
      [
        "Visualizado 202 vezes"
      ],
      [
        "JORNAL ELETRÔNICO DA ILHA DO MEL"
      ],
      [
        "Uma espuma marrom escuro tem aparecido com frequência na Praia de Fora.",
        "Na faixa de areia ela aparece disseminada e não chama muito a atenção.",
        "No Buraco do Aipo, com muitas pedras, ela aparece concentrada.",
        "É fácil saber que esta espuma estranha está lá, quando venta.",
        "Pequenos algodões de espuma começam a flutuar no espaço, pertinho da Praia do Saquinho.",
        "Quem pode ajudar na coleta deste material, envio a laboratório renomado e pagamento de análises, favor entrar em contato com o site."
      ]
    ]
  },
  "title": "ESPUMA MARROM CHAMADA ‟NINGUÃÂM MERECE‟ - paulo soavinski",
  "uri": "http://blogoosfero.cc/ilhadomel/pousadasilhadomel.com.br/espuma-marrom-chamada-ninguem-merece"
}

Data Fields

doc_id : The document ID
title : The document title
uri : URI where the document was extracted from
text : A list of document paragraphs (with a list of sentences in it as a list of strings)

Data Splits

The data is only split into train set with size of 3530796 samples.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{wagner2018brwac,
  title={The brwac corpus: A new open resource for brazilian portuguese},
  author={Wagner Filho, Jorge A and Wilkens, Rodrigo and Idiart, Marco and Villavicencio, Aline},
  booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

Contributions

Thanks to @jonatasgrosman for adding this dataset.

作者:

佚名

数据集大小:

13.82 KB