数据集:
csebuetnlp/BanglaParaphrase
任务:
文生文语言:
bn计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:2210.05109许可:
cc-by-nc-sa-4.0We present BanglaParaphrase, a high quality synthetic Bangla paraphrase dataset containing about 466k paraphrase pairs. The paraphrases ensures high quality by being semantically coherent and syntactically diverse.
from datasets import load_dataset from datasets import load_dataset ds = load_dataset("csebuetnlp/BanglaParaphrase")
One example from the train part of the dataset is given below in JSON format.
{ "source": "বেশিরভাগ সময় প্রকৃতির দয়ার ওপরেই বেঁচে থাকতেন উপজাতিরা।", "target": "বেশিরভাগ সময়ই উপজাতিরা প্রকৃতির দয়ার উপর নির্ভরশীল ছিল।" }
Dataset with train-dev-test example counts are given below:
Language | ISO 639-1 Code | Train | Validation | Test |
---|---|---|---|---|
Bengali | bn | 419, 967 | 233, 31 | 233, 32 |
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.
@article{akil2022banglaparaphrase, title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset}, author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat}, journal={arXiv preprint arXiv:2210.05109}, year={2022} }