数据集:

csebuetnlp/BanglaNMT

任务:

翻译

语言:

bn en

计算机处理:

translation

大小:

1M<n<10M

语言创建人:

found

批注创建人:

other
中文

Dataset Card for BanglaNMT

Dataset Summary

This is the largest Machine Translation (MT) dataset for Bengali-English, curated using novel sentence alignment methods introduced here .

Note: This is a filtered version of the original dataset that the authors used for NMT training. For the complete set, refer to the offical repository

Supported Tasks and Leaderboards

More information needed

Languages

  • Bengali
  • English

Usage

from datasets import load_dataset
dataset = load_dataset("csebuetnlp/BanglaNMT")

Dataset Structure

Data Instances

One example from the dataset is given below in JSON format.

{
  'bn': 'বিমানবন্দরে যুক্তরাজ্যে নিযুক্ত বাংলাদেশ হাইকমিশনার সাঈদা মুনা তাসনীম ও লন্ডনে বাংলাদেশ মিশনের জ্যেষ্ঠ কর্মকর্তারা তাকে বিদায় জানান।',
  'en': 'Bangladesh High Commissioner to the United Kingdom Saida Muna Tasneen and senior officials of Bangladesh Mission in London saw him off at the airport.'
}

Data Fields

The data fields are as follows:

  • bn : a string feature indicating the Bengali sentence.
  • en : a string feature indicating the English translation.

Data Splits

split count
train 2379749
validation 597
test 1000

Dataset Creation

More information needed

Curation Rationale

More information needed

Source Data

More information needed

Initial Data Collection and Normalization

More information needed

Who are the source language producers?

More information needed

Annotations

More information needed

Annotation process

More information needed

Who are the annotators?

More information needed

Personal and Sensitive Information

More information needed

Considerations for Using the Data

Social Impact of Dataset

More information needed

Discussion of Biases

More information needed

Other Known Limitations

More information needed

Additional Information

Dataset Curators

More information needed

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

If you use the dataset, please cite the following paper:

@inproceedings{hasan-etal-2020-low,
    title = "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for {B}engali-{E}nglish Machine Translation",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Samin, Kazi  and
      Hasan, Masum  and
      Basak, Madhusudan  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.207",
    doi = "10.18653/v1/2020.emnlp-main.207",
    pages = "2612--2623",
    abstract = "Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt.",
}

Contributions

Thanks to @abhik1505040 and @Tahmid for adding this dataset.