This is a multilingual dataset containing ~130k annotated sentence boundaries. It contains laws and court decision in 6 different languages.
[More Information Needed]
English, French, Italian, German, Portuguese, Spanish
It is structured in the following format: {language}_{type}_{shard}.jsonl.xz
type is one of the following:
Use the the dataset like this:
from datasets import load_dataset config = 'fr_laws' #{language}_{type} | to load all languages and/or all types, use 'all_all' dataset = load_dataset('rdcs/MultiLegalSBD', config)
[More Information Needed]
There is only one split available
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Please cite our ArXiv-Preprint
@misc{brugger2023multilegalsbd, title={MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset}, author={Tobias Brugger and Matthias Stürmer and Joel Niklaus}, year={2023}, eprint={2305.01211}, archivePrefix={arXiv}, primaryClass={cs.CL} }
[More Information Needed]