数据集:
ai4bharat/samanantar
计算机处理:
translation语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:2104.05596许可:
cc-by-nc-4.0Samanantar is the largest publicly available parallel corpora collection for Indic language: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
The corpus has 49.6M sentence pairs between English to Indian Languages.
[More Information Needed]
Samanantar contains parallel sentences between English ( en ) and 11 Indic language:
{ 'idx': 0, 'src': 'Prime Minister Narendra Modi met Her Majesty Queen Maxima of the Kingdom of the Netherlands today.', 'tgt': 'নতুন দিল্লিতে সোমবার প্রধানমন্ত্রী শ্রী নরেন্দ্র মোদীর সঙ্গে নেদারন্যান্ডসের মহারানী ম্যাক্সিমা সাক্ষাৎ করেন।', 'data_source': 'pmi' }
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution-NonCommercial 4.0 International .
@misc{ramesh2021samanantar, title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}, author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, year={2021}, eprint={2104.05596}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @albertvillanova for adding this dataset.