数据集:

rcds/MultiLegalSBD

中文

Dataset Card for Dataset Name

Dataset Summary

This is a multilingual dataset containing ~130k annotated sentence boundaries. It contains laws and court decision in 6 different languages.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English, French, Italian, German, Portuguese, Spanish

Dataset Structure

It is structured in the following format: {language}_{type}_{shard}.jsonl.xz

type is one of the following:

  • laws
  • judgements

Use the the dataset like this:

from datasets import load_dataset
config = 'fr_laws' #{language}_{type} | to load all languages and/or all types, use 'all_all'
dataset = load_dataset('rdcs/MultiLegalSBD', config)

Data Instances

[More Information Needed]

Data Fields

  • text: the original text
  • spans:
    • start: offset of the first character
    • end: offset of the last character
    • label: One label only -> Sentence
    • token_start: id of the first token
    • token_end: id of the last token
  • tokens:
    • text: token text
    • start: offset of the first character
    • end: offset of the last character
    • id: token id
    • ws: whether the token is followed by whitespace

Data Splits

There is only one split available

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

Please cite our ArXiv-Preprint

@misc{brugger2023multilegalsbd,
      title={MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset}, 
      author={Tobias Brugger and Matthias Stürmer and Joel Niklaus},
      year={2023},
      eprint={2305.01211},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

[More Information Needed]