Dataset Card for "LexGLUE"

Dataset Summary

Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2019), other previous multi-task NLP benchmarks (Conneau and Kiela, 2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark , a benchmark dataset to evaluate the performance of NLP methods in legal tasks. LexGLUE is based on seven existing legal NLP datasets, selected using criteria largely from SuperGLUE.

As in GLUE and SuperGLUE (Wang et al., 2019b,a), one of our goals is to push towards generic (or ‘foundation’) models that can cope with multiple NLP tasks, in our case legal NLP tasks possibly with limited task-specific fine-tuning. Another goal is to provide a convenient and informative entry point for NLP researchers and practitioners wishing to explore or develop methods for legalNLP. Having these goals in mind, the datasets we include in LexGLUE and the tasks they address have been simplified in several ways to make it easier for newcomers and generic models to address all tasks.

LexGLUE benchmark is accompanied by experimental infrastructure that relies on Hugging Face Transformers library and resides at: https://github.com/coastalcph/lex-glue .

Supported Tasks and Leaderboards

The supported tasks are the following:

Dataset	Source	Sub-domain	Task Type	Classes
ECtHR (Task A)	Chalkidis et al. (2019)	ECHR	Multi-label classification	10+1
ECtHR (Task B)	Chalkidis et al. (2021a)	ECHR	Multi-label classification	10+1
SCOTUS	Spaeth et al. (2020)	US Law	Multi-class classification	14
EUR-LEX	Chalkidis et al. (2021b)	EU Law	Multi-label classification	100
LEDGAR	Tuggener et al. (2020)	Contracts	Multi-class classification	100
UNFAIR-ToS	Lippi et al. (2019)	Contracts	Multi-label classification	8+1
CaseHOLD	Zheng et al. (2021)	US Law	Multiple choice QA	n/a

ecthr_a

The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of the ECHR that were violated (if any).

ecthr_b

scotus

The US Supreme Court (SCOTUS) is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. This is a single-label multi-class classification task, where given a document (court opinion), the task is to predict the relevant issue areas. The 14 issue areas cluster 278 issues whose focus is on the subject matter of the controversy (dispute).

eurlex

European Union (EU) legislation is published in EUR-Lex portal. All EU laws are annotated by EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. The current version of EuroVoc contains more than 7k concepts referring to various activities of the EU and its Member States (e.g., economics, health-care, trade). Given a document, the task is to predict its EuroVoc labels (concepts).

ledgar

LEDGAR dataset aims contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.

unfair_tos

The UNFAIR-ToS dataset contains 50 Terms of Service (ToS) from on-line platforms (e.g., YouTube, Ebay, Facebook, etc.). The dataset has been annotated on the sentence-level with 8 types of unfair contractual terms (sentences), meaning terms that potentially violate user rights according to the European consumer law.

case_hold

The CaseHOLD (Case Holdings on Legal Decisions) dataset includes multiple choice questions about holdings of US court cases from the Harvard Law Library case law corpus. Holdings are short summaries of legal rulings accompany referenced decisions relevant for the present case. The input consists of an excerpt (or prompt) from a court decision, containing a reference to a particular case, while the holding statement is masked out. The model must identify the correct (masked) holding statement from a selection of five choices.

The current leaderboard includes several Transformer-based (Vaswaniet al., 2017) pre-trained language models, which achieve state-of-the-art performance in most NLP tasks (Bommasani et al., 2021) and NLU benchmarks (Wang et al., 2019a). Results reported by Chalkidis et al. (2021) :

Task-wise Test Results

Dataset	ECtHR A	ECtHR B	SCOTUS	EUR-LEX	LEDGAR	UNFAIR-ToS	CaseHOLD
Model	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1
TFIDF+SVM	64.7 / 51.7	74.6 / 65.1	78.2 / 69.5	71.3 / 51.4	87.2 / 82.4	95.4 / 78.8	n/a
Medium-sized Models (L=12, H=768, A=12)
BERT	71.2 / 63.6	79.7 / 73.4	68.3 / 58.3	71.4 / 57.2	87.6 / 81.8	95.6 / 81.3	70.8
RoBERTa	69.2 / 59.0	77.3 / 68.9	71.6 / 62.0	71.9 / 57.9	87.9 / 82.3	95.2 / 79.2	71.4
DeBERTa	70.0 / 60.8	78.8 / 71.0	71.1 / 62.7	72.1 / 57.4	88.2 / 83.1	95.5 / 80.3	72.6
Longformer	69.9 / 64.7	79.4 / 71.7	72.9 / 64.0	71.6 / 57.7	88.2 / 83.0	95.5 / 80.9	71.9
BigBird	70.0 / 62.9	78.8 / 70.9	72.8 / 62.0	71.5 / 56.8	87.8 / 82.6	95.7 / 81.3	70.8
Legal-BERT	70.0 / 64.0	80.4 / 74.7	76.4 / 66.5	72.1 / 57.4	88.2 / 83.0	96.0 / 83.0	75.3
CaseLaw-BERT	69.8 / 62.9	78.8 / 70.3	76.6 / 65.9	70.7 / 56.6	88.3 / 83.0	96.0 / 82.3	75.4
Large-sized Models (L=24, H=1024, A=18)
RoBERTa	73.8 / 67.6	79.8 / 71.6	75.5 / 66.3	67.9 / 50.3	88.6 / 83.6	95.8 / 81.6	74.4

Averaged (Mean over Tasks) Test Results

Averaging	Arithmetic	Harmonic	Geometric
Model	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1
Medium-sized Models (L=12, H=768, A=12)
BERT	77.8 / 69.5	76.7 / 68.2	77.2 / 68.8
RoBERTa	77.8 / 68.7	76.8 / 67.5	77.3 / 68.1
DeBERTa	78.3 / 69.7	77.4 / 68.5	77.8 / 69.1
Longformer	78.5 / 70.5	77.5 / 69.5	78.0 / 70.0
BigBird	78.2 / 69.6	77.2 / 68.5	77.7 / 69.0
Legal-BERT	79.8 / 72.0	78.9 / 70.8	79.3 / 71.4
CaseLaw-BERT	79.4 / 70.9	78.5 / 69.7	78.9 / 70.3
Large-sized Models (L=24, H=1024, A=18)
RoBERTa	79.4 / 70.8	78.4 / 69.1	78.9 / 70.0

Languages

We only consider English datasets, to make experimentation easier for researchers across the globe.