数据集:
jigsaw_toxicity_pred
任务:
文本分类语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
other批注创建人:
crowdsourced源数据集:
original许可:
cc0-1.0Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. This dataset consists of a large number of Wikipedia comments which have been labeled by human raters for toxic behavior.
The dataset support multi-label classification
The comments are in English
A data point consists of a comment followed by multiple labels that can be associated with it. {'id': '02141412314', 'comment_text': 'Sample comment text', 'toxic': 0, 'severe_toxic': 0, 'obscene': 0, 'threat': 0, 'insult': 0, 'identity_hate': 1, }
The data is split into a training and testing set.
The dataset was created to help in efforts to identify and curb instances of toxicity online.
The dataset is a collection of Wikipedia comments.
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
If words that are associated with swearing, insults or profanity are present in a comment, it is likely that it will be classified as toxic, regardless of the tone or the intent of the author e.g. humorous/self-deprecating. This could present some biases towards already vulnerable minority groups.
[More Information Needed]
[More Information Needed]
The "Toxic Comment Classification" dataset is released under [CC0], with the underlying comment text being governed by Wikipedia's [CC-SA-3.0].
No citation information.
Thanks to @Tigrex161 for adding this dataset.