数据集:

pile-of-law/eoir_privacy

任务:

文本分类

语言:

计算机处理:

monolingual

语言创建人:

found

预印本库:

arxiv:2207.00220

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for eoir_privacy

Dataset Summary

This dataset mimics privacy standards for EOIR decisions. It is meant to help learn contextual data sanitization rules to anonymize potentially sensitive contexts in crawled language data.

Languages

English

Dataset Structure

Data Instances

{ "text" : masked paragraph, "label" : whether to use a pseudonym in filling masks }

Data Splits

train 75%, validation 25%

Dataset Creation

Curation Rationale

This dataset mimics privacy standards for EOIR decisions. It is meant to help learn contextual data sanitization rules to anonymize potentially sensitive contexts in crawled language data.

Source Data

Initial Data Collection and Normalization

We scrape EOIR. We then filter at the paragraph level and replace any references to respondent, applicant, or names with [MASK] tokens. We then determine if the case used a pseudonym or not.

Who are the source language producers?

U.S. Executive Office for Immigration Review

Annotations

Annotation process

Annotations (i.e., pseudonymity decisions) were made by the EOIR court. We use regex to identify if a pseudonym was used to refer to the applicant/respondent.

Who are the annotators?

EOIR judges.

Personal and Sensitive Information

There may be sensitive contexts involved, the courts already make a determination as to data filtering of sensitive data, but nonetheless there may be sensitive topics discussed.

Considerations for Using the Data

Social Impact of Dataset

This dataset is meant to learn contextual privacy rules to help filter private/sensitive data, but itself encodes biases of the courts from which the data came. We suggest that people look beyond this data for learning more contextual privacy rules.

Discussion of Biases

Data may be biased due to its origin in U.S. immigration courts.

Licensing Information

CC-BY-NC

Citation Information

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}

作者:

pile-of-law

数据集大小:

1.93 MB