数据集:

ethos

任务:

文本分类

子任务:

multi-label-classification sentiment-classification

语言:

计算机处理:

monolingual

大小:

n<1K

语言创建人:

found other

批注创建人:

crowdsourced expert-generated

源数据集:

original

预印本库:

arxiv:2006.08328

其他:

Hate Speech Detection Hate+Speech+Detection

许可:

agpl-3.0

数据集介绍文件清单

中文

Dataset Card for Ethos

Dataset Summary

ETHOS: onlinE haTe speecH detectiOn dataSet. This repository contains a dataset for hate speech detection on social media platforms, called Ethos. There are two variations of the dataset:

Ethos_Dataset_Binary : contains 998 comments in the dataset alongside with a label about hate speech presence or absence . 565 of them do not contain hate speech, while the rest of them, 433, contain.
Ethos_Dataset_Multi_Label which contains 8 labels for the 433 comments with hate speech content. These labels are violence (if it incites (1) or not (0) violence), directed_vs_general (if it is directed to a person (1) or a group (0)), and 6 labels about the category of hate speech like, gender , race , national_origin , disability , religion and sexual_orientation .

Ethos /ˈiːθɒs/ is a Greek word meaning “character” that is used to describe the guiding beliefs or ideals that characterize a community, nation, or ideology. The Greeks also used this word to refer to the power of music to influence emotions, behaviors, and even morals.

Supported Tasks and Leaderboards

[More Information Needed]

text-classification-other-Hate Speech Detection , sentiment-classification , multi-label-classification : The dataset can be used to train a model for hate speech detection. Moreover, it can be used as a benchmark dataset for multi label classification algorithms.

Languages

The text in the dataset is in English.

Dataset Structure

Data Instances

A typical data point in the binary version comprises a comment, with a text containing the text and a label describing if a comment contains hate speech content (1 - hate-speech) or not (0 - non-hate-speech). In the multilabel version more labels like violence (if it incites (1) or not (0) violence), directed_vs_general (if it is directed to a person (1) or a group (0)), and 6 labels about the category of hate speech like, gender , race , national_origin , disability , religion and sexual_orientation are appearing.

An example from the binary version, which is offensive, but it does not contain hate speech content:

{'text': 'What the fuck stupid people !!!',
 'label': '0'
}

An example from the multi-label version, which contains hate speech content towards women (gender):

{'text': 'You should know women's sports are a joke',
 `violence`: 0,
 `directed_vs_generalized`: 0,
 `gender`: 1,
 `race`: 0,
 `national_origin`: 0,
 `disability`: 0,
 `religion`: 0,
 `sexual_orientation`: 0
}

Data Fields

Ethos Binary:

text : a string feature containing the text of the comment.
label : a classification label, with possible values including no_hate_speech , hate_speech .

Ethis Multilabel:

text : a string feature containing the text of the comment.
violence : a classification label, with possible values including not_violent , violent .
directed_vs_generalized : a classification label, with possible values including generalized , directed .
gender : a classification label, with possible values including false , true .
race : a classification label, with possible values including false , true .
national_origin : a classification label, with possible values including false , true .
disability : a classification label, with possible values including false , true .
religion : a classification label, with possible values including false , true .
sexual_orientation : a classification label, with possible values including false , true .

Data Splits

The data is split into binary and multilabel. Multilabel is a subset of the binary version.

Instances	Labels
binary	998	1
multilabel	433	8

Dataset Creation

Curation Rationale

The dataset was build by gathering online comments in Youtube videos and reddit comments, from videos and subreddits which may attract hate speech content.

Source Data

Initial Data Collection and Normalization

The initial data we used are from the hatebusters platform: Original data used , but they were not included in this dataset

Who are the source language producers?

The language producers are users of reddit and Youtube. More informations can be found in this paper: ETHOS: an Online Hate Speech Detection Dataset

Annotations

Annotation process

The annotation process is detailed in the third section of this paper: ETHOS: an Online Hate Speech Detection Dataset

Who are the annotators?

Originally anotated by Ioannis Mollas and validated through the Figure8 platform (APEN).

Personal and Sensitive Information

No personal and sensitive information included in the dataset.

Considerations for Using the Data

Social Impact of Dataset

This dataset will help on the evolution of the automated hate speech detection tools. Those tools have great impact on preventing social issues.

Discussion of Biases

This dataset tries to be unbiased towards its classes and labels.

Other Known Limitations

The dataset is relatively small and should be used combined with larger datasets.

Additional Information

Dataset Curators

The dataset was initially created by Intelligent Systems Lab .

Licensing Information

The licensing status of the datasets is GNU GPLv3 .

Citation Information

@misc{mollas2020ethos,
      title={ETHOS: an Online Hate Speech Detection Dataset}, 
      author={Ioannis Mollas and Zoe Chrysopoulou and Stamatis Karlos and Grigorios Tsoumakas},
      year={2020},
      eprint={2006.08328},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @iamollas for adding this dataset.

作者:

佚名

数据集大小:

19.53 KB