中文

Dataset Card for pile-pii-scrubadub

Dataset Summary

This dataset contains text from The Pile , annotated based on the toxicity of each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the toxicity predicted by the Detoxify .

Supported Tasks and Leaderboards

[More Information Needed]

Languages

This dataset is taken from The Pile , which is English text.

Dataset Structure

Data Instances

1949977

Data Fields

  • texts (sequence): a list of the sentences in the document, segmented using SpaCy
  • meta (dict): the section of The Pile from which it originated
  • scores (sequence): a score for each sentence in the texts column indicating the toxicity predicted by Detoxify
  • avg_score (float64): the average of the scores listed in the scores column
  • num_sents (int64): the number of sentences (and scores) in that document

Data Splits

Training set only

Dataset Creation

Curation Rationale

This is labeled text from The Pile , a large dataset of text in English. The text is scored for toxicity so that generative language models can be trained to avoid generating toxic text.

Source Data

Initial Data Collection and Normalization

This is labeled text from The Pile .

Who are the source language producers?

Please see The Pile for the source of the dataset.

Annotations

Annotation process

Each sentence was scored using Detoxify , which is a toxic comment classifier. We used the unbiased model which is based on the 124M parameter RoBERTa and trained on the Jigsaw Unintended Bias in Toxicity Classification dataset .

Who are the annotators?

Detoxify

Personal and Sensitive Information

This dataset contains all personal identifable information and toxic text that was originally contained in The Pile .

Considerations for Using the Data

Social Impact of Dataset

This dataset contains examples of toxic text and personal identifiable information. (A version of this datatset with personal identifiable information annotated is available here .) Please take care to avoid misusing the toxic text or putting anybody in danger by publicizing their information. This dataset is intended for research purposes only. We cannot guarantee that all toxic text has been detected, and we cannot guarantee that models trained using it will avoid generating toxic text. We do not recommend deploying models trained on this data.

Discussion of Biases

This dataset contains all biases from The Pile discussed in their paper: https://arxiv.org/abs/2101.00027

Other Known Limitations

The toxic text in this dataset was detected using imperfect automated detection methods. We cannot guarantee that the labels are 100% accurate.

Additional Information

Dataset Curators

The Pile

Licensing Information

From The Pile : PubMed Central: MIT License

Citation Information

Paper information to be added

Contributions

The Pile