数据集:
ruanchaves/hatebr
任务:
文本分类语言:
pt计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated源数据集:
original其他:
instagram数字对象标识符:
10.57967/hf/0274HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of F1-score outperforming the current literature models for the Portuguese language. Accordingly, we hope that the proposed expertly annotated corpus may foster research on hate speech and offensive language detection in the Natural Language Processing area.
Relevant Links:
Hate Speech Detection
Portuguese
{'instagram_comments': 'Hipocrita!!', 'offensive_language': True, 'offensiveness_levels': 2, 'antisemitism': False, 'apology_for_the_dictatorship': False, 'fatphobia': False, 'homophobia': False, 'partyism': False, 'racism': False, 'religious_intolerance': False, 'sexism': False, 'xenophobia': False, 'offensive_&_non-hate_speech': True, 'non-offensive': False, 'specialist_1_hate_speech': False, 'specialist_2_hate_speech': False, 'specialist_3_hate_speech': False }
The original authors of the dataset did not propose a standard data split. To address this, we use the multi-label data stratification technique implemented at the scikit-multilearn library to propose a train-validation-test split. This method considers all classes for hate speech in the data and attempts to balance the representation of each class in the split.
name | train | validation | test |
---|---|---|---|
hatebr | 4480 | 1120 | 1400 |
Please refer to the HateBR paper for a discussion of biases.
The HateBR dataset, including all its components, is provided strictly for academic and research purposes. The use of the dataset for any commercial or non-academic purpose is expressly prohibited without the prior written consent of SINCH .
@inproceedings{vargas2022hatebr, title={HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection}, author={Vargas, Francielle and Carvalho, Isabelle and de G{\'o}es, Fabiana Rodrigues and Pardo, Thiago and Benevenuto, Fabr{\'\i}cio}, booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference}, pages={7174--7183}, year={2022} }
Thanks to @ruanchaves for adding this dataset.