数据集:

jfrenz/legalglue

任务:

文本分类

标记分类

子任务:

named-entity-recognition multi-label-classification topic-classification

语言:

计算机处理:

multilingual

源数据集:

extended

预印本库:

arxiv:2003.13016 arxiv:2110.00806 arxiv:2109.00904

其他:

german-ler lener-br

数据集介绍文件清单

中文

Dataset Card for "LegalGLUE"

Dataset Summary

The "Legal General Language Understanding Evaluation" (LegalGLUE) dataset was created as part of a bachelor thesis. It consists of four already existing datasets covering three task types and a total of 23 different languages.

Supported Tasks

Dataset	Source	Task Type	Languages
German_LER	Leitner et al.	Named Entity Recognition	German
LeNER_Br	de Araujo et al., 2018	Named Entity Recognition	Portuguese
SwissJudgmentPrediction	Niklaus et al.	Binary Text Classification	German, French, Italian
MultEURLEX	Chalkidis et al.	Multi-label Text Classification	23 languages (see below)

Languages

see Split section

Dataset Structure

Data Instances

German_LER

German_LER example

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'german_ler')

{
  'id': '66722',
  'tokens':['4.', 'Die', 'Kostenentscheidung', 'für', 'das', 'gerichtliche', 'Antragsverfahren', 'beruht', 'auf', '§', '21', 'Abs.', '2', 'Satz', '1', 'i.', 'V.', 'm.', '§', '20', 'Abs.', '1', 'Satz', '1', 'WBO', '.'],
  'ner_tags': [38, 38, 38, 38, 38, 38, 38, 38, 38, 3, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 38]
}

LeNER-Br

LeNER-Br example

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'lener_br')

{
  'id': '7826',
  'tokens': ['Firmado', 'por', 'assinatura', 'digital', '(', 'MP', '2.200-2/2001', ')', 'JOSÉ', 'ROBERTO', 'FREIRE', 'PIMENTA', 'Ministro', 'Relator', 'fls', '.', 'PROCESSO', 'Nº', 'TST-RR-1603-79.2010.5.20.0001'],
  'ner_tags': [0, 0, 0, 0, 0, 9, 10, 0, 3, 4, 4, 4, 0, 0, 0, 0, 11, 12, 12]}

SwissJudgmentPrediction

swissJudgmentPrediction_de example

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'swissJudgmentPrediction_de')

{
  'id': 48755,
  'year': 2014,
  'text': "Sachverhalt: A. X._ fuhr am 25. Juli 2012 bei Mülligen mit seinem Personenwagen auf dem zweiten Überholstreifen der Autobahn A1 in Richtung Zürich. Gemäss Anklage schloss er auf einen Lieferwagen auf und schwenkte vom zweiten auf den ersten Überholstreifen aus. Danach fuhr er an zwei Fahrzeugen rechts vorbei und wechselte auf die zweite Überholspur zurück. B. Das Obergericht des Kantons Aargau erklärte X._ am 14. Januar 2014 zweitinstanzlich der groben Verletzung der Verkehrsregeln schuldig. Es bestrafte ihn mit einer bedingten Geldstrafe von 30 Tagessätzen zu Fr. 430.-- und einer Busse von Fr. 3'000.--. C. X._ führt Beschwerde in Strafsachen. Er beantragt, er sei von Schuld und Strafe freizusprechen. Eventualiter sei die Sache an die Vorinstanz zurückzuweisen. ",
  'label': 0,
  'language': 'de',
  'region': 'Northwestern Switzerland',
  'canton': 'ag',
  'legal area': 'penal law'
}

MultiEURLEX

Monolingual example out of the MultiEURLEX-Dataset

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'multi_eurlex_de')

{
  'celex_id': '32002R0130',
  'text': 'Verordnung (EG) Nr. 130/2002 der Kommission\nvom 24. Januar 2002\nbezüglich der im Rahmen der Auss...',
  'labels': [3, 17, 5]}

Multilingual example out of the MultiEURLEX-Dataset

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'multi_eurlex_all_languages')

{
  'celex_id': '32002R0130',
  'text': {
    'bg': None,
    'cs': None,
    'da': 'Kommissionens ...',
    'de': 'Verordnung ... ',
    'el': '...',
    'en': '...',
    ...
    },
    'labels': [3, 17, 5]
  }

Data Fields

German_LER

id : id of the sample
tokens : the tokens of the sample text
ner_tags : the NER tags of each token

LeNER_Br

id : id of the sample
tokens : the tokens of the sample text
ner_tags : the NER tags of each token

SwissJudgmentPrediction

id : ( int ) ID of the document
year : ( int ) the publication year
text : ( str ) the facts of the case
label : ( class label ) the judgment outcome: 0 (dismissal) or 1 (approval)
language : ( str ) one of (de, fr, it)
region : ( str ) the region of the lower court
canton : ( str ) the canton of the lower court
legal area : ( str ) the legal area of the case

MultiEURLEX

Monolingual use:

celex_id : ( str ) Official Document ID of the document
text : ( str ) An EU Law
labels : ( List[int] ) List of relevant EUROVOC concepts (labels)

Multilingual use:

celex_id : ( str ) Official Document ID of the document
text : (dict[ str ]) A dictionary with the 23 languages as keys and the corresponding EU Law as values.
labels : ( List[int] ) List of relevant EUROVOC concepts (labels)

The labels lists consists per default of level 1 EUROVOC concepts. Can be changed by adding the label_level parameter when loading the dataset. (available levels: level_1, level_2, level_3, all_levels)

from datasets import load_dataset
dataset = load_dataset('jfrenz/legalglue', 'multi_eurlex_de', label_level="level_3")

Data Splits

Dataset	Language	ISO code	Number of Documents train/dev/test
German-LER	German	de	66723 / - / -
LeNER-Br	Portuguese	pt	7828 / 1177 / 1390
SwissJudgmentPrediction	German	de	35458 / 4705 / 9725
French	fr	21179 / 3095 / 6820
Italian	it	3072 / 408 / 812
MultiEURLEX	English	en	55,000 / 5,000 / 5,000
German	de	55,000 / 5,000 / 5,000
French	fr	55,000 / 5,000 / 5,000
Italian	it	55,000 / 5,000 / 5,000
Spanish	es	52,785 / 5,000 / 5,000
Polish	pl	23,197 / 5,000 / 5,000
Romanian	ro	15,921 / 5,000 / 5,000
Dutch	nl	55,000 / 5,000 / 5,000
Greek	el	55,000 / 5,000 / 5,000
Hungarian	hu	22,664 / 5,000 / 5,000
Portuguese	pt	23,188 / 5,000 / 5,000
Czech	cs	23,187 / 5,000 / 5,000
Swedish	sv	42,490 / 5,000 / 5,000
Bulgarian	bg	15,986 / 5,000 / 5,000
Danish	da	55,000 / 5,000 / 5,000
Finnish	fi	42,497 / 5,000 / 5,000
Slovak	sk	15,986 / 5,000 / 5,000
Lithuanian	lt	23,188 / 5,000 / 5,000
Croatian	hr	7,944 / 2,500 / 5,000
Slovene	sl	23,184 / 5,000 / 5,000
Estonian	et	23,126 / 5,000 / 5,000
Latvian	lv	23,188 / 5,000 / 5,000
Maltese	mt	17,521 / 5,000 / 5,000

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]

作者:

jfrenz

数据集大小:

15.61 MB