数据集:

csebuetnlp/xnli_bn

任务:

文本分类

子任务:

natural-language-inference

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

machine-generated

源数据集:

extended

预印本库:

arxiv:2101.00204 arxiv:2007.01852

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for xnli_bn

Dataset Summary

This is a Natural Language Inference (NLI) dataset for Bengali, curated using the subset of MNLI data used in XNLI and state-of-the-art English to Bengali translation model introduced here .

Supported Tasks and Leaderboards

More information needed

Languages

Bengali

Usage

from datasets import load_dataset
dataset = load_dataset("csebuetnlp/xnli_bn")

Dataset Structure

Data Instances

One example from the dataset is given below in JSON format.

{
  "sentence1": "আসলে, আমি এমনকি এই বিষয়ে চিন্তাও করিনি, কিন্তু আমি এত হতাশ হয়ে পড়েছিলাম যে, শেষ পর্যন্ত আমি আবার তার সঙ্গে কথা বলতে শুরু করেছিলাম",
  "sentence2": "আমি তার সাথে আবার কথা বলিনি।",
  "label": "contradiction"
}

Data Fields

The data fields are as follows:

sentence1 : a string feature indicating the premise.
sentence2 : a string feature indicating the hypothesis.
label : a classification label, where possible values are contradiction (0), entailment (1), neutral (2) .

Data Splits

split	count
train	381449
validation	2419
test	4895

Dataset Creation

The dataset curation procedure was the same as the XNLI dataset: we translated the MultiNLI training data using the English to Bangla translation model introduced here . Due to the possibility of incursions of error during automatic translation, we used the Language-Agnostic BERT Sentence Embeddings (LaBSE) of the translations and original sentences to compute their similarity. All sentences below a similarity threshold of 0.70 were discarded.

Curation Rationale

More information needed

Source Data

XNLI

Initial Data Collection and Normalization

More information needed

Who are the source language producers?

More information needed

Annotations

More information needed

Annotation process

More information needed

Who are the annotators?

More information needed

Personal and Sensitive Information

More information needed

Considerations for Using the Data

Additional Information

Dataset Curators

More information needed

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

If you use the dataset, please cite the following paper:

@misc{bhattacharjee2021banglabert,
      title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
      author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
      year={2021},
      eprint={2101.00204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @abhik1505040 and @Tahmid for adding this dataset.

作者:

csebuetnlp

数据集大小:

20.46 MB

Dataset Card for xnli_bn

Dataset Summary

Supported Tasks and Leaderboards

Languages

Usage

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions