数据集:

tomekkorbak/pile-pii-scrubadub

任务:

文本分类

task_categories:other

子任务:

acceptability-classification text-scoring

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

machine-generated

源数据集:

extended|the_pile

预印本库:

arxiv:2101.00027

其他:

pii personal identifiable

许可:

mit

数据集介绍文件清单

中文

Dataset Card for pile-pii-scrubadub

Dataset Summary

This dataset contains text from The Pile , annotated based on the personal idenfitiable information (PII) in each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the percentage of words in it that are classified as PII by Scrubadub .

Supported Tasks and Leaderboards

[More Information Needed]

Languages

This dataset is taken from The Pile , which is English text.

Dataset Structure

Data Instances

1949977

Data Fields

texts (sequence): a list of the sentences in the document (segmented using SpaCy )
meta (dict): the section of The Pile from which it originated
scores (sequence): a score for each sentence in the texts column indicating the percent of words that are detected as PII by Scrubadub
avg_score (float64): the average of the scores listed in the scores column
num_sents (int64): the number of sentences (and scores) in that document

Data Splits

Training set only

Dataset Creation

Curation Rationale

This is labeled text from The Pile , a large dataset of text in English. The PII is labeled so that generative language models can be trained to avoid generating PII.

Source Data

Initial Data Collection and Normalization

This is labeled text from The Pile .

Who are the source language producers?

Please see The Pile for the source of the dataset.

Annotations

Annotation process

For each sentence, Scrubadub was used to detect:

email addresses
addresses and postal codes
phone numbers
credit card numbers
US social security numbers
vehicle plates numbers
dates of birth
URLs
login credentials

Who are the annotators?

Scrubadub

Personal and Sensitive Information

This dataset contains all PII that was originally contained in The Pile , with all detected PII annotated.

Considerations for Using the Data

Social Impact of Dataset

This dataset contains examples of real PII (conveniently annotated in the text!). Please take care to avoid misusing it or putting anybody in danger by publicizing their information. This dataset is intended for research purposes only. We cannot guarantee that all PII has been detected, and we cannot guarantee that models trained using it will avoid generating PII. We do not recommend deploying models trained on this data.

Discussion of Biases

This dataset contains all biases from The Pile discussed in their paper: https://arxiv.org/abs/2101.00027

Other Known Limitations

The PII in this dataset was detected using imperfect automated detection methods. We cannot guarantee that the labels are 100% accurate.

Additional Information

Dataset Curators

The Pile

Licensing Information

From The Pile : PubMed Central: MIT License

Citation Information

Paper information to be added

Contributions

The Pile

作者:

tomekkorbak

数据集大小:

6.51 GB