数据集:

csebuetnlp/squad_bn

任务:

问答

子任务:

open-domain-qa extractive-qa

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

machine-generated

源数据集:

extended

预印本库:

arxiv:2101.00204 arxiv:2007.01852 arxiv:1606.05250

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for squad_bn

Dataset Summary

This is a Question Answering (QA) dataset for Bengali, curated from the SQuAD 2.0 , TyDI-QA datasets and using the state-of-the-art English to Bengali translation model introduced here .

Supported Tasks and Leaderboards

More information needed

Languages

Bengali

Usage

from datasets import load_dataset
dataset = load_dataset("csebuetnlp/squad_bn")

Dataset Structure

Data Instances

One example from the dataset is given below in JSON format.

{
  "title": "শেখ মুজিবুর রহমান",
  "paragraphs": [
      {
          "qas": [
              {
                  "answers": [
                      {
                          "answer_start": 19,
                          "text": "১৭ মার্চ ১৯২০"
                      }
                  ],
                  "id": "bengali--981248442377505718-0-2649",
                  "question": "শেখ মুজিবুর রহমান কবে জন্মগ্রহণ করেন ?"
              }
          ],
          "context": "শেখ মুজিবুর রহমান (১৭ মার্চ ১৯২০ - ১৫ আগস্ট ১৯৭৫) বাংলাদেশের প্রথম রাষ্ট্রপতি ও ভারতীয় উপমহাদেশের একজন অন্যতম প্রভাবশালী রাজনৈতিক ব্যক্তিত্ব যিনি বাঙালীর অধিকার রক্ষায় ব্রিটিশ ভারত থেকে ভারত বিভাজন আন্দোলন এবং পরবর্তীতে  পূর্ব পাকিস্তান থেকে বাংলাদেশ প্রতিষ্ঠার সংগ্রামে নেতৃত্ব প্রদান করেন। প্রাচীন বাঙ্গালি সভ্যতার আধুনিক স্থপতি হিসাবে শেখ মুজিবুর রহমানকে বাংলাদেশের জাতির জনক বা জাতির পিতা বলা হয়ে থাকে। তিনি মাওলানা আব্দুল হামিদ খান ভাসানী প্রতিষ্ঠিত আওয়ামী লীগের সভাপতি, বাংলাদেশের প্রথম রাষ্ট্রপতি এবং পরবর্তীতে এদেশের প্রধানমন্ত্রীর দায়িত্ব পালন করেন। জনসাধারণের কাছে তিনি শেখ মুজিব এবং শেখ সাহেব হিসাবে বেশি পরিচিত এবং তার উপাধি বঙ্গবন্ধু। তার কন্যা শেখ হাসিনা বাংলাদেশ আওয়ামী লীগের বর্তমান সভানেত্রী এবং বাংলাদেশের বর্তমান প্রধানমন্ত্রী।"
      }
  ]
}

Data Fields

The data fields are as follows:

id : a string feature.
title : a string feature.
context : a string feature.
question : a string feature.
answers : a dictionary feature containing:
- text : a string feature.
- answer_start : a int32 feature.

Data Splits

split	count
train	127771
validation	2502
test	2504

Dataset Creation

For the training set, we translated the complete SQuAD 2.0 dataset using the English to Bangla translation model introduced here . Due to the possibility of incursions of error during automatic translation, we used the Language-Agnostic BERT Sentence Embeddings (LaBSE) of the translations and original sentences to compute their similarity. A datapoint was accepted if all of its constituent sentences had a similarity score over 0.7.

Since the TyDI-QA Gold Passage task guarantees that the given context contains the answer and we want to pose our QA task analogous to SQuAD 2.0, we also consider examples from the Passage selection task that don't have an answer for the given question. We distribute the resultant examples from the TyDI-QA training and validation sets (which are publicly available) evenly to our test and validation sets.

Curation Rationale

More information needed

Source Data

SQuAD 2.0 , TyDi-QA

Initial Data Collection and Normalization

More information needed

Who are the source language producers?

More information needed

Annotations

More information needed

Annotation process

More information needed

Who are the annotators?

More information needed

Personal and Sensitive Information

More information needed

Considerations for Using the Data

Additional Information

Dataset Curators

More information needed

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

If you use the dataset, please cite the following paper:

@misc{bhattacharjee2021banglabert,
      title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
      author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
      year={2021},
      eprint={2101.00204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @abhik1505040 and @Tahmid for adding this dataset.

作者:

csebuetnlp

数据集大小:

8.06 MB

Dataset Card for squad_bn

Dataset Summary

Supported Tasks and Leaderboards

Languages

Usage

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions