数据集:
OxAISH-AL-LLM/wiki_toxic
任务:
文本分类语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
crowdsourced源数据集:
extended|other许可:
cc0-1.0The Wiki Toxic dataset is a modified, cleaned version of the dataset used in the Kaggle Toxic Comment Classification challenge from 2017/18. The dataset contains comments collected from Wikipedia forums and classifies them into two categories, toxic and non-toxic .
The Kaggle dataset was cleaned using the included clean.py file.
The sole language used in the dataset is English.
For each data point, there is an id, the comment_text itself, and a label (0 for non-toxic, 1 for toxic).
{'id': 'a123a58f610cffbc', 'comment_text': '"This article SUCKS. It may be poorly written, poorly formatted, or full of pointless crap that no one cares about, and probably all of the above. If it can be rewritten into something less horrible, please, for the love of God, do so, before the vacuum caused by its utter lack of quality drags the rest of Wikipedia down into a bottomless pit of mediocrity."', 'label': 1}
The Wiki Toxic dataset has three splits: train , validation , and test . The statistics for each split are below:
Dataset Split | Number of data points in split |
---|---|
Train | 127,656 |
Validation | 31,915 |
Test | 63,978 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thanks to @github-username for adding this dataset.