数据集:

csebuetnlp/BanglaParaphrase

任务:

文生文

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:2210.05109

其他:

conditional-text-generation paraphrase-generation

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for "BanglaParaphrase"

Dataset Summary

We present BanglaParaphrase, a high quality synthetic Bangla paraphrase dataset containing about 466k paraphrase pairs. The paraphrases ensures high quality by being semantically coherent and syntactically diverse.

Supported Tasks and Leaderboards

More information needed

Languages

bengali

Loading the dataset

from datasets import load_dataset

from datasets import load_dataset

ds = load_dataset("csebuetnlp/BanglaParaphrase")

Dataset Structure

Data Instances

One example from the train part of the dataset is given below in JSON format.

{
"source": "বেশিরভাগ সময় প্রকৃতির দয়ার ওপরেই বেঁচে থাকতেন উপজাতিরা।", 
"target": "বেশিরভাগ সময়ই উপজাতিরা প্রকৃতির দয়ার উপর নির্ভরশীল ছিল।"
}

Data Fields

'source': A string representing the source sentence.
'target': A string representing the target sentence.

Data Splits

Dataset with train-dev-test example counts are given below:

Language	ISO 639-1 Code	Train	Validation	Test
Bengali	bn	419, 967	233, 31	233, 32

Dataset Creation

Curation Rationale

More information needed

Source Data

Roar Bangla

Initial Data Collection and Normalization

Detailed in the paper

Who are the source language producers?

Detailed in the paper

Annotations

Detailed in the paper

Annotation process

Detailed in the paper

Who are the annotators?

Detailed in the paper

Personal and Sensitive Information

More information needed

Considerations for Using the Data

Additional Information

Dataset Curators

More information needed

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

@article{akil2022banglaparaphrase,
  title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
  author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
  journal={arXiv preprint arXiv:2210.05109},
  year={2022}
}

Contributions

作者:

csebuetnlp

数据集大小:

36.25 MB

Dataset Card for "BanglaParaphrase"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Loading the dataset

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions