数据集:
tomekkorbak/pile-detoxify
语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
machine-generated源数据集:
extended|the_pile许可:
mitThis dataset contains text from The Pile , annotated based on the toxicity of each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the toxicity predicted by the Detoxify .
[More Information Needed]
This dataset is taken from The Pile , which is English text.
1949977
Training set only
This is labeled text from The Pile , a large dataset of text in English. The text is scored for toxicity so that generative language models can be trained to avoid generating toxic text.
This is labeled text from The Pile .
Who are the source language producers?Please see The Pile for the source of the dataset.
Each sentence was scored using Detoxify , which is a toxic comment classifier. We used the unbiased model which is based on the 124M parameter RoBERTa and trained on the Jigsaw Unintended Bias in Toxicity Classification dataset .
Who are the annotators?This dataset contains all personal identifable information and toxic text that was originally contained in The Pile .
This dataset contains examples of toxic text and personal identifiable information. (A version of this datatset with personal identifiable information annotated is available here .) Please take care to avoid misusing the toxic text or putting anybody in danger by publicizing their information. This dataset is intended for research purposes only. We cannot guarantee that all toxic text has been detected, and we cannot guarantee that models trained using it will avoid generating toxic text. We do not recommend deploying models trained on this data.
This dataset contains all biases from The Pile discussed in their paper: https://arxiv.org/abs/2101.00027
The toxic text in this dataset was detected using imperfect automated detection methods. We cannot guarantee that the labels are 100% accurate.
From The Pile : PubMed Central: MIT License
Paper information to be added