数据集:

coastalcph/fairlex

子任务:

multi-class-classification topic-classification multi-label-classification

语言:

语言创建人:

found

任务:

文本分类

批注创建人:

found machine-generated

源数据集:

extended

预印本库:

arxiv:2103.13868 arxiv:2105.03887

其他:

bias gender-bias

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for "FairLex"

Dataset Summary

We present a benchmark suite of four datasets for evaluating the fairness of pre-trained legal language models and the techniques used to fine-tune them for downstream tasks. Our benchmarks cover four jurisdictions (European Council, USA, Swiss, and Chinese), five languages (English, German, French, Italian, and Chinese), and fairness across five attributes (gender, age, nationality/region, language, and legal area). In our experiments, we evaluate pre-trained language models using several group-robust fine-tuning techniques and show that performance group disparities are vibrant in many cases, while none of these techniques guarantee fairness, nor consistently mitigate group disparities. Furthermore, we provide a quantitative and qualitative analysis of our results, highlighting open challenges in the development of robustness methods in legal NLP.

For the purpose of this work, we release four domain-specific BERT models with continued pre-training on the corpora of the examined datasets (ECtHR, SCOTUS, FSCS, CAIL). We train mini-sized BERT models with 6 Transformer blocks, 384 hidden units, and 12 attention heads. We warm-start all models from the public MiniLMv2 (Wang et al., 2021) using the distilled version of RoBERTa (Liu et al., 2019). For the English datasets (ECtHR, SCOTUS) and the one distilled from XLM-R (Conneau et al., 2021) for the rest (trilingual FSCS, and Chinese CAIL). [ Link to Models ]

Supported Tasks and Leaderboards

The supported tasks are the following:

Dataset	Source	Sub-domain	Language	Task Type	Classes
ECtHR	Chalkidis et al. (2019)	ECHR	en	Multi-label classification	10+1
SCOTUS	Spaeth et al. (2020)	US Law	en	Multi-class classification	11
FSCS	Niklaus et al. (2021)	Swiss Law	en, fr , it	Binary classification	2
CAIL	Wang et al. (2021)	Chinese Law	zh	Multi-class classification	6

ecthr

The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). We use the dataset of Chalkidis et al. (2021), which contains 11K cases from ECtHR's public database. Each case is mapped to articles of the ECHR that were violated (if any). This is a multi-label text classification task. Given the facts of a case, the goal is to predict the ECHR articles that were violated, if any, as decided (ruled) by the court. The cases are chronologically split into training (9k, 2001--16), development (1k, 2016--17), and test (1k, 2017--19) sets.

To facilitate the study of the fairness of text classifiers, we record for each case the following attributes: (a) The defendant states , which are the European states that allegedly violated the ECHR. The defendant states for each case is a subset of the 47 Member States of the Council of Europe; To have statistical support, we group defendant states in two groups: Central-Eastern European states, on one hand, and all other states, as classified by the EuroVoc thesaurus. (b) The applicant's age at the time of the decision. We extract the birth year of the applicant from the case facts, if possible, and classify its case in an age group (<=35, <=64, or older); and (c) the applicant's gender , extracted from the facts, if possible based on pronouns, classified in two categories (male, female).

scotus

The US Supreme Court (SCOTUS) is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases that have not been sufficiently well solved by lower courts. We combine information from SCOTUS opinions with the Supreme Court DataBase (SCDB) (Spaeth, 2020). SCDB provides metadata (e.g., date of publication, decisions, issues, decision directions, and many more) for all cases. We consider the available 14 thematic issue areas (e.g, Criminal Procedure, Civil Rights, Economic Activity, etc.). This is a single-label multi-class document classification task. Given the court's opinion, the goal is to predict the issue area whose focus is on the subject matter of the controversy (dispute). SCOTUS contains a total of 9,262 cases that we split chronologically into 80% for training (7.4k, 1946--1982), 10% for development (914, 1982--1991) and 10% for testing (931, 1991--2016).

From SCDB, we also use the following attributes to study fairness: (a) the type of respondent , which is a manual categorization of respondents (defendants) in five categories (person, public entity, organization, facility, and other); and (c) the direction of the decision , i.e., whether the decision is liberal, or conservative, provided by SCDB.

fscs

The Federal Supreme Court of Switzerland (FSCS) is the last level of appeal in Switzerland and similarly to SCOTUS, the court generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. The court often focuses only on small parts of the previous decision, where they discuss possible wrong reasoning by the lower court. The Swiss-Judgment-Predict dataset (Niklaus et al., 2021) contains more than 85K decisions from the FSCS written in one of three languages (50K German, 31K French, 4K Italian) from the years 2000 to 2020. The dataset is not parallel, i.e., all cases are unique and decisions are written only in a single language. The dataset provides labels for a simplified binary ( approval , dismissal ) classification task. Given the facts of the case, the goal is to predict if the plaintiff's request is valid or partially valid. The cases are also chronologically split into training (59.7k, 2000-2014), development (8.2k, 2015-2016), and test (17.4k, 2017-2020) sets.

The dataset provides three additional attributes: (a) the language of the FSCS written decision, in either German, French, or Italian; (b) the legal area of the case (public, penal, social, civil, or insurance law) derived from the chambers where the decisions were heard; and (c) the region that denotes in which federal region was the case originated.

cail

The Supreme People's Court of China (CAIL) is the last level of appeal in China and considers cases that originated from the high people's courts concerning matters of national importance. The Chinese AI and Law challenge (CAIL) dataset (Xiao et al., 2018) is a Chinese legal NLP dataset for judgment prediction and contains over 1m criminal cases. The dataset provides labels for relevant article of criminal code prediction, charge (type of crime) prediction, imprisonment term (period) prediction, and monetary penalty prediction. The publication of the original dataset has been the topic of an active debate in the NLP community(Leins et al., 2020; Tsarapatsanis and Aletras, 2021; Bender, 2021).

Recently, Wang et al. (2021) re-annotated a subset of approx. 100k cases with demographic attributes. Specifically, the new dataset has been annotated with: (a) the applicant's gender , classified in two categories (male, female); and (b) the region of the court that denotes in which out of the 7 provincial-level administrative regions was the case judged. We re-split the dataset chronologically into training (80k, 2013-2017), development (12k, 2017-2018), and test (12k, 2018) sets. In our study, we re-frame the imprisonment term prediction and examine a soft version, dubbed crime severity prediction task, a multi-class classification task, where given the facts of a case, the goal is to predict how severe was the committed crime with respect to the imprisonment term. We approximate crime severity by the length of imprisonment term, split in 6 clusters (0, <=12, <=36, <=60, <=120, >120 months).

Languages

We consider datasets in English, German, French, Italian, and Chinese.

Dataset Structure

Data Instances

ecthr

An example of 'train' looks as follows.

{
  "text": "1.  At the beginning of the events relevant to the application, K. had a daughter, P., and a son, M., born in 1986 and 1988 respectively. ... ",
  "labels": [4],
  "defendant_state": 1,
  "applicant_gender": 0,
  "applicant_age": 0
}

scotus

An example of 'train' looks as follows.

{
  "text": "United States Supreme Court MICHIGAN NAT. BANK v. MICHIGAN(1961) No. 155 Argued: Decided: March 6, 1961 </s> R.  S. 5219 permits States to tax the shares of national banks, but not at a greater rate than . . . other moneyed capital . . . coming into competition with the business of national banks ...",
  "label": 9,
  "decision_direction": 0,
  "respondent_type": 3
}

fscs

An example of 'train' looks as follows.

{
  "text": "A.- Der 1955 geborene V._ war seit 1. September 1986 hauptberuflich als technischer Kaufmann bei der Firma A._ AG tätig und im Rahmen einer Nebenbeschäftigung (Nachtarbeit) ab Mai 1990 bei einem Bewachungsdienst angestellt gewesen, als er am 10....",
  "label": 0,
  "decision_language": 0,
  "legal_are": 5,
  "court_region": 2
}

cail

An example of 'train' looks as follows.

{
  "text": "南宁市兴宁区人民检察院指控，2012年1月1日19时许，被告人蒋满德在南宁市某某路某号某市场内，因经营问题与被害人杨某某发生争吵并推打 ...",
  "label": 0,
  "defendant_gender": 0,
  "court_region": 5
}

Data Fields

ecthr_a

text : a string feature (factual paragraphs (facts) from the case description).
labels : a list of classification labels (a list of violated ECHR articles, if any). The ECHR articles considered are 2, 3, 5, 6, 8, 9, 11, 14, P1-1.
defendant_state : Defendant State group (C.E. European, Rest of Europe)
applicant_gender : The gender of the applicant (N/A, Male, Female)
applicant_age : The age group of the applicant (N/A, <=35, <=64, or older)

scotus

text : a string feature (the court opinion).
label : a classification label (the relevant issue area). The issue areas are: (1, Criminal Procedure), (2, Civil Rights), (3, First Amendment), (4, Due Process), (5, Privacy), (6, Attorneys), (7, Unions), (8, Economic Activity), (9, Judicial Power), (10, Federalism), (11, Interstate Relations), (12, Federal Taxation), (13, Miscellaneous), (14, Private Action).
respondent_type : the type of respondent, which is a manual categorization (clustering) of respondents (defendants) in five categories (person, public entity, organization, facility, and other).
decision_direction : the direction of the decision, i.e., whether the decision is liberal, or conservative, provided by SCDB.

fscs

text : a string feature (an EU law).
label : a classification label (approval or dismissal of the appeal).
language : the language of the FSCS written decision, (German, French, or Italian).
legal_area : the legal area of the case (public, penal, social, civil, or insurance law) derived from the chambers where the decisions were heard.
region : the region that denotes in which federal region was the case originated.

cail

text : a string feature (the factual description of the case).
label : a classification label (crime severity derived by the imprisonment term).
defendant_gender : the gender of the defendant (Male or Female).
court_region : the region of the court that denotes in which out of the 7 provincial-level administrative regions was the case judged.

Data Splits

Dataset	Training	Development	Test	Total
ECtHR	9000	1000	1000	11000
SCOTUS	7417	914	931	9262
FSCS	59709	8208	17357	85274
CAIL	80000	12000	12000	104000

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Dataset	Source	Sub-domain	Language	Task Type	Classes
ECtHR	Chalkidis et al. (2019)	ECHR	en	Multi-label classification	10+1
SCOTUS	Spaeth et al. (2020)	US Law	en	Multi-class classification	14
FSCS	Niklaus et al. (2021)	Swiss Law	en, fr , it	Binary classification	2
CAIL	Wang et al. (2021)	Chinese Law	zh	Multi-class classification	6

Initial Data Collection and Normalization

We standardize and put together four datasets: ECtHR (Chalkidis et al., 2021), SCOTUS (Spaeth et al., 2020), FSCS (Niklaus et al., 2021), and CAIL (Xiao et al., 2018; Wang et al., 2021) that are already publicly available.

The benchmark is not a blind stapling of pre-existing resources, we augment previous datasets. In the case of ECtHR, previously unavailable demographic attributes have been released to make the original dataset amenable for fairness research. For SCOTUS, two resources (court opinions with SCDB) have been combined for the very same reason, while the authors provide a manual categorization (clustering) of respondents.

All datasets, except SCOTUS, are publicly available and have been previously published. If datasets or the papers where they were introduced were not compiled or written by the authors, the original work is referenced and authors encourage FairLex users to do so as well. In fact, this work should only be referenced, in addition to citing the original work, when jointly experimenting with multiple FairLex datasets and using the FairLex evaluation framework and infrastructure, or using any newly introduced annotations (ECtHR, SCOTUS). Otherwise only the original work should be cited.

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

All classification labels rely on legal decisions (ECtHR, FSCS, CAIL), or are part of archival procedures (SCOTUS).

The demographic attributes and other metadata are either provided by the legal databases or have been extracted automatically from the text by means of Regular Expressions.

Consider the Dataset Description and Discussion of Biases sections, and the original publication for detailed information.

Personal and Sensitive Information

The data is in general partially anonymized in accordance with the applicable national law. The data is considered to be in the public sphere from a privacy perspective. This is a very sensitive matter, as the courts try to keep a balance between transparency (the public's right to know) and privacy (respect for private and family life). ECtHR cases are partially annonymized by the court. Its data is processed and made public in accordance with the European Data Protection Law. SCOTUS cases may also contain personal information and the data is processed and made available by the US Supreme Court, whose proceedings are public. While this ensures compliance with US law, it is very likely that similarly to the ECtHR any processing could be justified by either implied consent or legitimate interest under European law. In FSCS, the names of the parties have been redacted by the courts according to the official guidelines. CAIL cases are also partially anonymized by the courts according to the courts' policy. Its data is processed and made public in accordance with Chinese Law.

Considerations for Using the Data

Social Impact of Dataset

This work can help practitioners to build assisting technology for legal professionals - with respect to the legal framework (jurisdiction) they operate -; technology that does not only rely on performance on majority groups but also considering minorities and the robustness of the developed models across them. This is an important application field, where more research should be conducted (Tsarapatsanis and Aletras, 2021) in order to improve legal services and democratize law, but more importantly, highlight (inform the audience on) the various multi-aspect shortcomings seeking a responsible and ethical (fair) deployment of technology.

Discussion of Biases

The current version of FairLex covers a very small fraction of legal applications, jurisdictions, and protected attributes. The benchmark inevitably cannot cover " everything in the whole wide (legal) world " (Raji et al., 2021), but nonetheless, we believe that the published resources will help critical research in the area of fairness.

Some protected attributes within the datasets are extracted automatically, i.e., the gender and the age of the ECtHR dataset, by means of Regular Expressions, or manually clustered by the authors, such as the defendant state in the ECtHR dataset and the respondent attribute in the SCOTUS dataset. Those assumptions and simplifications can hold in an experimental setting only and by no means should be used in real-world applications where some simplifications, e.g., binary gender, would not be appropriate. By no means, do the authors or future users have to endorse the law standards or framework of the examined datasets, to any degree rather than the publication and use of the data.

Other Known Limitations

More Information Needed

Additional Information

More Information Needed

Dataset Curators

Ilias Chalkidis, Tommaso Pasini, Sheng Zhang, Letizia Tomada, Letizia, Sebastian Felix Schwemer, Anders Søgaard. FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing. 2022. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland.

Note: The original datasets have been originally curated by others, and further curated (updated) by means of this benchmark.

Licensing Information

The benchmark is released under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. The licensing is compatible with the licensing of former material (remixed, transformed datasets).

Citation Information

@inproceedings{chalkidis-etal-2022-fairlex,
      author={Chalkidis, Ilias and Passini, Tommaso and Zhang, Sheng and
              Tomada, Letizia and Schwemer, Sebastian Felix and Søgaard, Anders},
      title={FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing},
      booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
      year={2022},
      address={Dublin, Ireland}
}

Note: Please consider citing and giving credits to all publications releasing the examined datasets.

Contributions

Thanks to @iliaschalkidis for adding this dataset.

作者:

coastalcph

数据集大小:

51.21 KB