数据集:
harem
任务:
标记分类语言:
pt计算机处理:
monolingual大小:
n<1K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
license:unknownThe HAREM is a Portuguese language corpus commonly used for Named Entity Recognition tasks. It includes about 93k words, from 129 different texts, from several genres, and language varieties. The split of this dataset version follows the division made by [1], where 7% HAREM documents are the validation set and the miniHAREM corpus (with about 65k words) is the test set. There are two versions of the dataset set, a version that has a total of 10 different named entity classes (Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other) and a "selective" version with only 5 classes (Person, Organization, Location, Value, and Date).
It's important to note that the original version of the HAREM dataset has 2 levels of NER details, namely "Category" and "Sub-type". The dataset version processed here ONLY USE the "Category" level of the original dataset.
[1] Souza, Fábio, Rodrigo Nogueira, and Roberto Lotufo. "BERTimbau: Pretrained BERT Models for Brazilian Portuguese." Brazilian Conference on Intelligent Systems. Springer, Cham, 2020.
[More Information Needed]
Portuguese
{ "id": "HAREM-871-07800", "ner_tags": [3, 0, 0, 3, 4, 4, 4, 4, 4, 4, 4, 4, ], "tokens": [ "Abraço", "Página", "Principal", "ASSOCIAÇÃO", "DE", "APOIO", "A", "PESSOAS", "COM", "VIH", "/", "SIDA" ] }
The NER tags correspond to this list:
"O", "B-PESSOA", "I-PESSOA", "B-ORGANIZACAO", "I-ORGANIZACAO", "B-LOCAL", "I-LOCAL", "B-TEMPO", "I-TEMPO", "B-VALOR", "I-VALOR", "B-ABSTRACCAO", "I-ABSTRACCAO", "B-ACONTECIMENTO", "I-ACONTECIMENTO", "B-COISA", "I-COISA", "B-OBRA", "I-OBRA", "B-OUTRO", "I-OUTRO"
The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word.
The data is split into train, validation and test set for each of the two versions (default and selective). The split sizes are as follow:
Train | Val | Test |
---|---|---|
121 | 8 | 128 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{santos2006harem, title={Harem: An advanced ner evaluation contest for portuguese}, author={Santos, Diana and Seco, Nuno and Cardoso, Nuno and Vilela, Rui}, booktitle={quot; In Nicoletta Calzolari; Khalid Choukri; Aldo Gangemi; Bente Maegaard; Joseph Mariani; Jan Odjik; Daniel Tapias (ed) Proceedings of the 5 th International Conference on Language Resources and Evaluation (LREC'2006)(Genoa Italy 22-28 May 2006)}, year={2006} }
Thanks to @jonatasgrosman for adding this dataset.