数据集:
tomekkorbak/pile-pii-scrubadub
语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
machine-generated源数据集:
extended|the_pile预印本库:
arxiv:2101.00027许可:
mitThis dataset contains text from The Pile , annotated based on the personal idenfitiable information (PII) in each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the percentage of words in it that are classified as PII by Scrubadub .
[More Information Needed]
This dataset is taken from The Pile , which is English text.
1949977
Training set only
This is labeled text from The Pile , a large dataset of text in English. The PII is labeled so that generative language models can be trained to avoid generating PII.
This is labeled text from The Pile .
Who are the source language producers?Please see The Pile for the source of the dataset.
For each sentence, Scrubadub was used to detect:
This dataset contains all PII that was originally contained in The Pile , with all detected PII annotated.
This dataset contains examples of real PII (conveniently annotated in the text!). Please take care to avoid misusing it or putting anybody in danger by publicizing their information. This dataset is intended for research purposes only. We cannot guarantee that all PII has been detected, and we cannot guarantee that models trained using it will avoid generating PII. We do not recommend deploying models trained on this data.
This dataset contains all biases from The Pile discussed in their paper: https://arxiv.org/abs/2101.00027
The PII in this dataset was detected using imperfect automated detection methods. We cannot guarantee that the labels are 100% accurate.
From The Pile : PubMed Central: MIT License
Paper information to be added