数据集:

tomekkorbak/pile-detoxify

任务:

文本分类

task_categories:other

子任务:

acceptability-classification hate-speech-detection text-scoring

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

machine-generated

源数据集:

extended|the_pile

预印本库:

arxiv:1907.11692 arxiv:2101.00027

其他:

toxicity pretraining-with-human-feedback

许可:

mit

数据集介绍文件清单

中文

Dataset Card for pile-pii-scrubadub

Dataset Summary

This dataset contains text from The Pile , annotated based on the toxicity of each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the toxicity predicted by the Detoxify .

Supported Tasks and Leaderboards

[More Information Needed]

Languages

This dataset is taken from The Pile , which is English text.

Dataset Structure

Data Instances

1949977

Data Fields

texts (sequence): a list of the sentences in the document, segmented using SpaCy
meta (dict): the section of The Pile from which it originated
scores (sequence): a score for each sentence in the texts column indicating the toxicity predicted by Detoxify
avg_score (float64): the average of the scores listed in the scores column
num_sents (int64): the number of sentences (and scores) in that document

Data Splits

Training set only

Dataset Creation

Curation Rationale

This is labeled text from The Pile , a large dataset of text in English. The text is scored for toxicity so that generative language models can be trained to avoid generating toxic text.

Source Data

Initial Data Collection and Normalization

This is labeled text from The Pile .

Who are the source language producers?

Please see The Pile for the source of the dataset.

Annotations

Annotation process

Each sentence was scored using Detoxify , which is a toxic comment classifier. We used the unbiased model which is based on the 124M parameter RoBERTa and trained on the Jigsaw Unintended Bias in Toxicity Classification dataset .

Who are the annotators?

Detoxify

Personal and Sensitive Information

This dataset contains all personal identifable information and toxic text that was originally contained in The Pile .

Considerations for Using the Data

Social Impact of Dataset

This dataset contains examples of toxic text and personal identifiable information. (A version of this datatset with personal identifiable information annotated is available here .) Please take care to avoid misusing the toxic text or putting anybody in danger by publicizing their information. This dataset is intended for research purposes only. We cannot guarantee that all toxic text has been detected, and we cannot guarantee that models trained using it will avoid generating toxic text. We do not recommend deploying models trained on this data.

Discussion of Biases

This dataset contains all biases from The Pile discussed in their paper: https://arxiv.org/abs/2101.00027

Other Known Limitations

The toxic text in this dataset was detected using imperfect automated detection methods. We cannot guarantee that the labels are 100% accurate.

Additional Information

Dataset Curators

The Pile

Licensing Information

From The Pile : PubMed Central: MIT License

Citation Information

Paper information to be added

Contributions

The Pile

作者:

tomekkorbak

数据集大小:

6.98 GB