数据集:
pile-of-law/eoir_privacy
This dataset mimics privacy standards for EOIR decisions. It is meant to help learn contextual data sanitization rules to anonymize potentially sensitive contexts in crawled language data.
English
{ "text" : masked paragraph, "label" : whether to use a pseudonym in filling masks }
train 75%, validation 25%
This dataset mimics privacy standards for EOIR decisions. It is meant to help learn contextual data sanitization rules to anonymize potentially sensitive contexts in crawled language data.
We scrape EOIR. We then filter at the paragraph level and replace any references to respondent, applicant, or names with [MASK] tokens. We then determine if the case used a pseudonym or not.
Who are the source language producers?U.S. Executive Office for Immigration Review
Annotations (i.e., pseudonymity decisions) were made by the EOIR court. We use regex to identify if a pseudonym was used to refer to the applicant/respondent.
Who are the annotators?EOIR judges.
There may be sensitive contexts involved, the courts already make a determination as to data filtering of sensitive data, but nonetheless there may be sensitive topics discussed.
This dataset is meant to learn contextual privacy rules to help filter private/sensitive data, but itself encodes biases of the courts from which the data came. We suggest that people look beyond this data for learning more contextual privacy rules.
Data may be biased due to its origin in U.S. immigration courts.
CC-BY-NC
@misc{hendersonkrass2022pileoflaw, url = {https://arxiv.org/abs/2207.00220}, author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.}, title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset}, publisher = {arXiv}, year = {2022} }