数据集:

tomekkorbak/pile-pii-scrubadub

中文

Dataset Card for pile-pii-scrubadub

Dataset Summary

This dataset contains text from The Pile , annotated based on the personal idenfitiable information (PII) in each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the percentage of words in it that are classified as PII by Scrubadub .

Supported Tasks and Leaderboards

[More Information Needed]

Languages

This dataset is taken from The Pile , which is English text.

Dataset Structure

Data Instances

1949977

Data Fields

  • texts (sequence): a list of the sentences in the document (segmented using SpaCy )
  • meta (dict): the section of The Pile from which it originated
  • scores (sequence): a score for each sentence in the texts column indicating the percent of words that are detected as PII by Scrubadub
  • avg_score (float64): the average of the scores listed in the scores column
  • num_sents (int64): the number of sentences (and scores) in that document

Data Splits

Training set only

Dataset Creation

Curation Rationale

This is labeled text from The Pile , a large dataset of text in English. The PII is labeled so that generative language models can be trained to avoid generating PII.

Source Data

Initial Data Collection and Normalization

This is labeled text from The Pile .

Who are the source language producers?

Please see The Pile for the source of the dataset.

Annotations

Annotation process

For each sentence, Scrubadub was used to detect:

  • email addresses
  • addresses and postal codes
  • phone numbers
  • credit card numbers
  • US social security numbers
  • vehicle plates numbers
  • dates of birth
  • URLs
  • login credentials
Who are the annotators?

Scrubadub

Personal and Sensitive Information

This dataset contains all PII that was originally contained in The Pile , with all detected PII annotated.

Considerations for Using the Data

Social Impact of Dataset

This dataset contains examples of real PII (conveniently annotated in the text!). Please take care to avoid misusing it or putting anybody in danger by publicizing their information. This dataset is intended for research purposes only. We cannot guarantee that all PII has been detected, and we cannot guarantee that models trained using it will avoid generating PII. We do not recommend deploying models trained on this data.

Discussion of Biases

This dataset contains all biases from The Pile discussed in their paper: https://arxiv.org/abs/2101.00027

Other Known Limitations

The PII in this dataset was detected using imperfect automated detection methods. We cannot guarantee that the labels are 100% accurate.

Additional Information

Dataset Curators

The Pile

Licensing Information

From The Pile : PubMed Central: MIT License

Citation Information

Paper information to be added

Contributions

The Pile