数据集:
ethos
ETHOS: onlinE haTe speecH detectiOn dataSet. This repository contains a dataset for hate speech detection on social media platforms, called Ethos. There are two variations of the dataset:
Ethos /ˈiːθɒs/ is a Greek word meaning “character” that is used to describe the guiding beliefs or ideals that characterize a community, nation, or ideology. The Greeks also used this word to refer to the power of music to influence emotions, behaviors, and even morals.
[More Information Needed]
The text in the dataset is in English.
A typical data point in the binary version comprises a comment, with a text containing the text and a label describing if a comment contains hate speech content (1 - hate-speech) or not (0 - non-hate-speech). In the multilabel version more labels like violence (if it incites (1) or not (0) violence), directed_vs_general (if it is directed to a person (1) or a group (0)), and 6 labels about the category of hate speech like, gender , race , national_origin , disability , religion and sexual_orientation are appearing.
An example from the binary version, which is offensive, but it does not contain hate speech content:
{'text': 'What the fuck stupid people !!!', 'label': '0' }
An example from the multi-label version, which contains hate speech content towards women (gender):
{'text': 'You should know women's sports are a joke', `violence`: 0, `directed_vs_generalized`: 0, `gender`: 1, `race`: 0, `national_origin`: 0, `disability`: 0, `religion`: 0, `sexual_orientation`: 0 }
Ethos Binary:
Ethis Multilabel:
The data is split into binary and multilabel. Multilabel is a subset of the binary version.
Instances | Labels | |
---|---|---|
binary | 998 | 1 |
multilabel | 433 | 8 |
The dataset was build by gathering online comments in Youtube videos and reddit comments, from videos and subreddits which may attract hate speech content.
The initial data we used are from the hatebusters platform: Original data used , but they were not included in this dataset
Who are the source language producers?The language producers are users of reddit and Youtube. More informations can be found in this paper: ETHOS: an Online Hate Speech Detection Dataset
The annotation process is detailed in the third section of this paper: ETHOS: an Online Hate Speech Detection Dataset
Who are the annotators?Originally anotated by Ioannis Mollas and validated through the Figure8 platform (APEN).
No personal and sensitive information included in the dataset.
This dataset will help on the evolution of the automated hate speech detection tools. Those tools have great impact on preventing social issues.
This dataset tries to be unbiased towards its classes and labels.
The dataset is relatively small and should be used combined with larger datasets.
The dataset was initially created by Intelligent Systems Lab .
The licensing status of the datasets is GNU GPLv3 .
@misc{mollas2020ethos, title={ETHOS: an Online Hate Speech Detection Dataset}, author={Ioannis Mollas and Zoe Chrysopoulou and Stamatis Karlos and Grigorios Tsoumakas}, year={2020}, eprint={2006.08328}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @iamollas for adding this dataset.