数据集:
joelito/german_argument_mining
This dataset consists of 200 randomly chosen judgments. In these judgments a legal expert annotated the components conclusion, definition and subsumption of the German legal writing style Urteilsstil.
"Overall 25,075 sentences are annotated. 5% (1,202) of these sentences are marked as conclusion, 21% (5,328) as definition, 53% (13,322) are marked as subsumption and the remaining 21% (6,481) as other. The length of judgments in sentences ranges from 38 to 862 sentences. The median of judgments have 97 sentences, the length of most judgments is on the shorter side." (Urchs. et al., 2021)
"Judgments from 22 of the 131 courts are selected for the corpus. Most judgments originate from the VG Augsburg (59 / 30%) followed by the VG Ansbach (39 / 20%) and LSG Munich (33 / 17%)." (Urchs. et al., 2021)
"29% (58) of all selected judgments are issued in the year 2016, followed by 22% (44) from the year 2017 and 21% (41) issued in the year 2015. [...] The percentages of selected judgments and decisions issued in 2018 and 2019 are roughly the same. No judgments from 2020 are selected." (Urchs. et al., 2021)
The dataset can be used for multi-class text classification tasks, more specifically, for argument mining.
The language in the dataset is German as it is used in Bavarian courts in Germany.
Each sentence is saved as a json object on a line in one of the three files train.jsonl , validation.jsonl or test.jsonl . The file meta.jsonl contains meta information for each court. The file_number is present in all files for identification. Each sentence of the court decision was categorized according to its function.
The file meta.jsonl contains for each row the following fields:
The files train.jsonl , validation.jsonl and test.jsonl contain the following fields:
No split provided in the original release.
Splits created by Joel Niklaus. We randomly split the dataset into 80% (160 decisions, 19271 sentences) train, 10% validation (20 decisions, 2726 sentences) and 10% test (20 decisions, 3078 sentences). We made sure, that a decision only occurs in one split and is not dispersed over multiple splits.
Label Distribution
label | train | validation | test |
---|---|---|---|
conclusion | 975 | 115 | 112 |
definition | 4105 | 614 | 609 |
subsumption | 10034 | 1486 | 1802 |
other | 4157 | 511 | 555 |
total | 19271 | 2726 | 3078 |
Creating a publicly available German legal text corpus consisting of judgments that have been annotated by a legal expert. The annotated components consist of conclusion , definition and subsumption of the German legal writing style Urteilsstil .
“The decision corpus is a collection of the decisions published on the website www.gesetze-bayern.de . At the time of the crawling the website offered 32,748 decisions of 131 Bavarian courts, dating back to 2015. The decisions are provided from the Bavarian state after the courts agreed to a publication. All decisions are processed by the publisher C.H.BECK, commissioned by the Bavarian state. This processing includes anonymisation, key-wording, and adding of editorial guidelines to the decisions.” (Urchs. et al., 2021)
Who are the source language producers?German courts from Bavaria
“As stated above, the judgment corpus consist of 200 randomly chosen judgments that are annotated by a legal expert, who holds a first legal state exam. Due to financial, staff and time reasons the presented iteration of the corpus was only annotated by a single expert. In a future version several other experts will annotate the corpus and the inter-annotator agreement will be calculated.” (Urchs. et al., 2021)
Who are the annotators?A legal expert, who holds a first legal state exam.
"All decisions are processed by the publisher C.H.BECK, commissioned by the Bavarian state. This processing includes ** anonymisation , key-wording, and adding of editorial guidelines to the decisions.” (Urchs. et al., 2021)
[More Information Needed]
[More Information Needed]
The SoMaJo Sentence Splitter has been used. Upon manual inspection of the dataset, we could see that the sentence splitter had poor accuracy in some cases (see analyze_dataset() in convert_to_hf_dataset.py ). When creating the splits, we thought about merging small sentences with their neighbors or removing them all together. However, since we could not find an straightforward way to do this, we decided to leave the dataset content untouched.
Note that the information given in this dataset card refer to the dataset version as provided by Joel Niklaus and Veton Matoshi. The dataset at hand is intended to be part of a bigger benchmark dataset. Creating a benchmark dataset consisting of several other datasets from different sources requires postprocessing. Therefore, the structure of the dataset at hand, including the folder structure, may differ considerably from the original dataset. In addition to that, differences with regard to dataset statistics as give in the respective papers can be expected. The reader is advised to have a look at the conversion script convert_to_hf_dataset.py in order to retrace the steps for converting the original dataset into the present jsonl-format. For further information on the original dataset structure, we refer to the bibliographical references and the original Github repositories and/or web pages provided in this dataset card.
The names of the original dataset curators and creators can be found in references given below, in the section Citation Information . Additional changes were made by Joel Niklaus ( Email ; Github ) and Veton Matoshi ( Email ; Github ).
Creative Commons Attribution 4.0 International
@dataset{urchs_stefanie_2020_3936490, author = {Urchs, Stefanie and Mitrović, Jelena}, title = {{German legal jugements annotated with judement style components}}, month = jul, year = 2020, publisher = {Zenodo}, doi = {10.5281/zenodo.3936490}, url = {https://doi.org/10.5281/zenodo.3936490} }
@conference{icaart21, author = {Urchs., Stefanie and Mitrovi{\'{c}}., Jelena and Granitzer., Michael}, booktitle = {Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,}, doi = {10.5220/0010187305150521}, isbn = {978-989-758-484-8}, issn = {2184-433X}, organization = {INSTICC}, pages = {515--521}, publisher = {SciTePress}, title = {{Design and Implementation of German Legal Decision Corpora}}, year = {2021} }
Thanks to @kapllan and @joelniklaus for adding this dataset.