This is a multilingual dataset containing ~130k annotated sentence boundaries. It contains laws and court decision in 6 different languages.
[More Information Needed]
English, French, Italian, German, Portuguese, Spanish
It is structured in the following format: {language}_{type}_{shard}.jsonl.xz
type is one of the following:
Use the the dataset like this:
from datasets import load_dataset
config = 'fr_laws' #{language}_{type} | to load all languages and/or all types, use 'all_all'
dataset = load_dataset('rdcs/MultiLegalSBD', config)
[More Information Needed]
There is only one split available
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Please cite our ArXiv-Preprint
@misc{brugger2023multilegalsbd,
title={MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset},
author={Tobias Brugger and Matthias Stürmer and Joel Niklaus},
year={2023},
eprint={2305.01211},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
[More Information Needed]