数据集:

rcds/MultiLegalSBD

任务:

标记分类

语言:

大小:

100K<n<1M

预印本库:

arxiv:2305.01211

数据集介绍文件清单

中文

Dataset Card for Dataset Name

Dataset Summary

This is a multilingual dataset containing ~130k annotated sentence boundaries. It contains laws and court decision in 6 different languages.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English, French, Italian, German, Portuguese, Spanish

Dataset Structure

It is structured in the following format: {language}_{type}_{shard}.jsonl.xz

type is one of the following:

laws
judgements

Use the the dataset like this:

from datasets import load_dataset
config = 'fr_laws' #{language}_{type} | to load all languages and/or all types, use 'all_all'
dataset = load_dataset('rdcs/MultiLegalSBD', config)

Data Instances

[More Information Needed]

Data Fields

text: the original text
spans:
- start: offset of the first character
- end: offset of the last character
- label: One label only -> Sentence
- token_start: id of the first token
- token_end: id of the last token
tokens:
- text: token text
- start: offset of the first character
- end: offset of the last character
- id: token id
- ws: whether the token is followed by whitespace

Data Splits

There is only one split available

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

Please cite our ArXiv-Preprint

@misc{brugger2023multilegalsbd,
      title={MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset}, 
      author={Tobias Brugger and Matthias Stürmer and Joel Niklaus},
      year={2023},
      eprint={2305.01211},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

[More Information Needed]

作者:

rcds

数据集大小:

26.3 MB