数据集:
brwac
语言:
pt计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
license:unknownThe BrWaC (Brazilian Portuguese Web as Corpus) is a large corpus constructed following the Wacky framework, which was made public for research purposes. The current corpus version, released in January 2017, is composed by 3.53 million documents, 2.68 billion tokens and 5.79 million types. Please note that this resource is available solely for academic research purposes, and you agreed not to use it for any commercial applications. Manually download at https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC
[More Information Needed]
Portuguese
An example from the BrWaC dataset looks as follows:
{ "doc_id": "netg-1afc73", "text": { "paragraphs": [ [ "Conteúdo recente" ], [ "ESPUMA MARROM CHAMADA \"NINGUÉM MERECE\"" ], [ "31 de Agosto de 2015, 7:07 , por paulo soavinski - | No one following this article yet." ], [ "Visualizado 202 vezes" ], [ "JORNAL ELETRÔNICO DA ILHA DO MEL" ], [ "Uma espuma marrom escuro tem aparecido com frequência na Praia de Fora.", "Na faixa de areia ela aparece disseminada e não chama muito a atenção.", "No Buraco do Aipo, com muitas pedras, ela aparece concentrada.", "É fácil saber que esta espuma estranha está lá, quando venta.", "Pequenos algodões de espuma começam a flutuar no espaço, pertinho da Praia do Saquinho.", "Quem pode ajudar na coleta deste material, envio a laboratório renomado e pagamento de análises, favor entrar em contato com o site." ] ] }, "title": "ESPUMA MARROM CHAMADA ‟NINGUÃÂM MERECE‟ - paulo soavinski", "uri": "http://blogoosfero.cc/ilhadomel/pousadasilhadomel.com.br/espuma-marrom-chamada-ninguem-merece" }
The data is only split into train set with size of 3530796 samples.
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{wagner2018brwac, title={The brwac corpus: A new open resource for brazilian portuguese}, author={Wagner Filho, Jorge A and Wilkens, Rodrigo and Idiart, Marco and Villavicencio, Aline}, booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year={2018} }
Thanks to @jonatasgrosman for adding this dataset.