数据集:
ai4bharat/samanantar
计算机处理:
translation语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:2104.05596许可:
Samanantar is the largest publicly available parallel corpora collection for Indic language: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
The corpus has 49.6M sentence pairs between English to Indian Languages.
[More Information Needed]
Samanantar contains parallel sentences between English ( en ) and 11 Indic language:
{
'idx': 0,
'src': 'Prime Minister Narendra Modi met Her Majesty Queen Maxima of the Kingdom of the Netherlands today.',
'tgt': 'নতুন দিল্লিতে সোমবার প্রধানমন্ত্রী শ্রী নরেন্দ্র মোদীর সঙ্গে নেদারন্যান্ডসের মহারানী ম্যাক্সিমা সাক্ষাৎ করেন।',
'data_source': 'pmi'
}
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution-NonCommercial 4.0 International .
@misc{ramesh2021samanantar,
title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2021},
eprint={2104.05596},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Thanks to @albertvillanova for adding this dataset.