中文

Dataset Card for "LegalGLUE"

Dataset Summary

The "Legal General Language Understanding Evaluation" (LegalGLUE) dataset was created as part of a bachelor thesis. It consists of four already existing datasets covering three task types and a total of 23 different languages.

Supported Tasks

Dataset Source Task Type Languages
German_LER Leitner et al. Named Entity Recognition German
LeNER_Br de Araujo et al., 2018 Named Entity Recognition Portuguese
SwissJudgmentPrediction Niklaus et al. Binary Text Classification German, French, Italian
MultEURLEX Chalkidis et al. Multi-label Text Classification 23 languages (see below)

Languages

see Split section

Dataset Structure

Data Instances

German_LER

German_LER example

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'german_ler')
{
  'id': '66722',
  'tokens':['4.', 'Die', 'Kostenentscheidung', 'für', 'das', 'gerichtliche', 'Antragsverfahren', 'beruht', 'auf', '§', '21', 'Abs.', '2', 'Satz', '1', 'i.', 'V.', 'm.', '§', '20', 'Abs.', '1', 'Satz', '1', 'WBO', '.'],
  'ner_tags': [38, 38, 38, 38, 38, 38, 38, 38, 38, 3, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 38]
}
LeNER-Br

LeNER-Br example

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'lener_br')
{
  'id': '7826',
  'tokens': ['Firmado', 'por', 'assinatura', 'digital', '(', 'MP', '2.200-2/2001', ')', 'JOSÉ', 'ROBERTO', 'FREIRE', 'PIMENTA', 'Ministro', 'Relator', 'fls', '.', 'PROCESSO', 'Nº', 'TST-RR-1603-79.2010.5.20.0001'],
  'ner_tags': [0, 0, 0, 0, 0, 9, 10, 0, 3, 4, 4, 4, 0, 0, 0, 0, 11, 12, 12]}
SwissJudgmentPrediction

swissJudgmentPrediction_de example

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'swissJudgmentPrediction_de')
{
  'id': 48755,
  'year': 2014,
  'text': "Sachverhalt: A. X._ fuhr am 25. Juli 2012 bei Mülligen mit seinem Personenwagen auf dem zweiten Überholstreifen der Autobahn A1 in Richtung Zürich. Gemäss Anklage schloss er auf einen Lieferwagen auf und schwenkte vom zweiten auf den ersten Überholstreifen aus. Danach fuhr er an zwei Fahrzeugen rechts vorbei und wechselte auf die zweite Überholspur zurück. B. Das Obergericht des Kantons Aargau erklärte X._ am 14. Januar 2014 zweitinstanzlich der groben Verletzung der Verkehrsregeln schuldig. Es bestrafte ihn mit einer bedingten Geldstrafe von 30 Tagessätzen zu Fr. 430.-- und einer Busse von Fr. 3'000.--. C. X._ führt Beschwerde in Strafsachen. Er beantragt, er sei von Schuld und Strafe freizusprechen. Eventualiter sei die Sache an die Vorinstanz zurückzuweisen. ",
  'label': 0,
  'language': 'de',
  'region': 'Northwestern Switzerland',
  'canton': 'ag',
  'legal area': 'penal law'
}
MultiEURLEX

Monolingual example out of the MultiEURLEX-Dataset

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'multi_eurlex_de')
{
  'celex_id': '32002R0130',
  'text': 'Verordnung (EG) Nr. 130/2002 der Kommission\nvom 24. Januar 2002\nbezüglich der im Rahmen der Auss...',
  'labels': [3, 17, 5]}

Multilingual example out of the MultiEURLEX-Dataset

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'multi_eurlex_all_languages')
{
  'celex_id': '32002R0130',
  'text': {
    'bg': None,
    'cs': None,
    'da': 'Kommissionens ...',
    'de': 'Verordnung ... ',
    'el': '...',
    'en': '...',
    ...
    },
    'labels': [3, 17, 5]
  }

Data Fields

German_LER
  • id : id of the sample
  • tokens : the tokens of the sample text
  • ner_tags : the NER tags of each token
LeNER_Br
  • id : id of the sample
  • tokens : the tokens of the sample text
  • ner_tags : the NER tags of each token
SwissJudgmentPrediction
  • id : ( int ) ID of the document
  • year : ( int ) the publication year
  • text : ( str ) the facts of the case
  • label : ( class label ) the judgment outcome: 0 (dismissal) or 1 (approval)
  • language : ( str ) one of (de, fr, it)
  • region : ( str ) the region of the lower court
  • canton : ( str ) the canton of the lower court
  • legal area : ( str ) the legal area of the case
MultiEURLEX

Monolingual use:

  • celex_id : ( str ) Official Document ID of the document
  • text : ( str ) An EU Law
  • labels : ( List[int] ) List of relevant EUROVOC concepts (labels)

Multilingual use:

  • celex_id : ( str ) Official Document ID of the document
  • text : (dict[ str ]) A dictionary with the 23 languages as keys and the corresponding EU Law as values.
  • labels : ( List[int] ) List of relevant EUROVOC concepts (labels)

The labels lists consists per default of level 1 EUROVOC concepts. Can be changed by adding the label_level parameter when loading the dataset. (available levels: level_1, level_2, level_3, all_levels)

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'multi_eurlex_de', label_level="level_3")

Data Splits

Dataset Language ISO code Number of Documents train/dev/test
German-LER German de 66723 / - / -
LeNER-Br Portuguese pt 7828 / 1177 / 1390
SwissJudgmentPrediction German de 35458 / 4705 / 9725
French fr 21179 / 3095 / 6820
Italian it 3072 / 408 / 812
MultiEURLEX English en 55,000 / 5,000 / 5,000
German de 55,000 / 5,000 / 5,000
French fr 55,000 / 5,000 / 5,000
Italian it 55,000 / 5,000 / 5,000
Spanish es 52,785 / 5,000 / 5,000
Polish pl 23,197 / 5,000 / 5,000
Romanian ro 15,921 / 5,000 / 5,000
Dutch nl 55,000 / 5,000 / 5,000
Greek el 55,000 / 5,000 / 5,000
Hungarian hu 22,664 / 5,000 / 5,000
Portuguese pt 23,188 / 5,000 / 5,000
Czech cs 23,187 / 5,000 / 5,000
Swedish sv 42,490 / 5,000 / 5,000
Bulgarian bg 15,986 / 5,000 / 5,000
Danish da 55,000 / 5,000 / 5,000
Finnish fi 42,497 / 5,000 / 5,000
Slovak sk 15,986 / 5,000 / 5,000
Lithuanian lt 23,188 / 5,000 / 5,000
Croatian hr 7,944 / 2,500 / 5,000
Slovene sl 23,184 / 5,000 / 5,000
Estonian et 23,126 / 5,000 / 5,000
Latvian lv 23,188 / 5,000 / 5,000
Maltese mt 17,521 / 5,000 / 5,000

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]