数据集:
csebuetnlp/xnli_bn
任务:
文本分类语言:
bn计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
machine-generated源数据集:
extended许可:
cc-by-nc-sa-4.0This is a Natural Language Inference (NLI) dataset for Bengali, curated using the subset of MNLI data used in XNLI and state-of-the-art English to Bengali translation model introduced here .
from datasets import load_dataset dataset = load_dataset("csebuetnlp/xnli_bn")
One example from the dataset is given below in JSON format.
{ "sentence1": "আসলে, আমি এমনকি এই বিষয়ে চিন্তাও করিনি, কিন্তু আমি এত হতাশ হয়ে পড়েছিলাম যে, শেষ পর্যন্ত আমি আবার তার সঙ্গে কথা বলতে শুরু করেছিলাম", "sentence2": "আমি তার সাথে আবার কথা বলিনি।", "label": "contradiction" }
The data fields are as follows:
split | count |
---|---|
train | 381449 |
validation | 2419 |
test | 4895 |
The dataset curation procedure was the same as the XNLI dataset: we translated the MultiNLI training data using the English to Bangla translation model introduced here . Due to the possibility of incursions of error during automatic translation, we used the Language-Agnostic BERT Sentence Embeddings (LaBSE) of the translations and original sentences to compute their similarity. All sentences below a similarity threshold of 0.70 were discarded.
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.
If you use the dataset, please cite the following paper:
@misc{bhattacharjee2021banglabert, title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding}, author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar}, year={2021}, eprint={2101.00204}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @abhik1505040 and @Tahmid for adding this dataset.