数据集:

joelito/lextreme

中文

Dataset Card for LEXTREME: A Multilingual Legal Benchmark for Natural Language Understanding

Dataset Summary

The dataset consists of 11 diverse multilingual legal NLU datasets. 6 datasets have one single configuration and 5 datasets have two or three configurations. This leads to a total of 18 tasks (8 single-label text classification tasks, 5 multi-label text classification tasks and 5 token-classification tasks).

Use the dataset like this:

from datasets import load_dataset
dataset = load_dataset("joelito/lextreme", "swiss_judgment_prediction")

Supported Tasks and Leaderboards

The dataset supports the tasks of text classification and token classification. In detail, we support the folliwing tasks and configurations:

task task type configurations link
Brazilian Court Decisions Judgment Prediction (judgment, unanimity) joelito/brazilian_court_decisions
Swiss Judgment Prediction Judgment Prediction default joelito/swiss_judgment_prediction
German Argument Mining Argument Mining default joelito/german_argument_mining
Greek Legal Code Topic Classification (volume, chapter, subject) greek_legal_code
Online Terms of Service Unfairness Classification (unfairness level, clause topic) online_terms_of_service
Covid 19 Emergency Event Event Classification default covid19_emergency_event
MultiEURLEX Topic Classification (level 1, level 2, level 3) multi_eurlex
LeNER BR Named Entity Recognition default lener_br
LegalNERo Named Entity Recognition default legalnero
Greek Legal NER Named Entity Recognition default greek_legal_ner
MAPA Named Entity Recognition (coarse, fine) mapa

Languages

The following languages are supported: bg , cs , da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

Dataset Structure

Data Instances

The file format is jsonl and three data splits are present for each configuration (train, validation and test).

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

How can I contribute a dataset to lextreme? Please follow the following steps:

  • Make sure your dataset is available on the huggingface hub and has a train, validation and test split.
  • Create a pull request to the lextreme repository by adding the following to the lextreme.py file:
    • Create a dict _{YOUR_DATASET_NAME} (similar to _BRAZILIAN_COURT_DECISIONS_JUDGMENT) containing all the necessary information about your dataset (task_type, input_col, label_col, etc.)
    • Add your dataset to the BUILDER_CONFIGS list: LextremeConfig(name="{your_dataset_name}", **_{YOUR_DATASET_NAME})
    • Test that it works correctly by loading your subset with load_dataset("lextreme", "{your_dataset_name}") and inspecting a few examples.
  • Dataset Curators

    [More Information Needed]

    Licensing Information

    [More Information Needed]

    Citation Information

    @misc{niklaus2023lextreme,
        title={LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain},
        author={Joel Niklaus and Veton Matoshi and Pooja Rani and Andrea Galassi and Matthias Stürmer and Ilias Chalkidis},
        year={2023},
        eprint={2301.13126},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
    }
    

    Contributions

    Thanks to @JoelNiklaus for adding this dataset.