数据集:

ai4bharat/samanantar

任务:

文本生成

翻译

语言:

计算机处理:

translation

大小:

size_categories:unknown

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:2104.05596

其他:

conditional-text-generation

许可:

cc-by-nc-4.0

数据集介绍文件清单

中文

Dataset Card for Samanantar

Dataset Summary

Samanantar is the largest publicly available parallel corpora collection for Indic language: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

The corpus has 49.6M sentence pairs between English to Indian Languages.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Samanantar contains parallel sentences between English ( en ) and 11 Indic language:

Assamese ( as ),
Bengali ( bn ),
Gujarati ( gu ),
Hindi ( hi ),
Kannada ( kn ),
Malayalam ( ml ),
Marathi ( mr ),
Odia ( or ),
Punjabi ( pa ),
Tamil ( ta ) and
Telugu ( te ).

Dataset Structure

Data Instances

{
  'idx': 0,
  'src': 'Prime Minister Narendra Modi met Her Majesty Queen Maxima of the Kingdom of the Netherlands today.',
  'tgt': 'নতুন দিল্লিতে সোমবার প্রধানমন্ত্রী শ্রী নরেন্দ্র মোদীর সঙ্গে নেদারন্যান্ডসের মহারানী ম্যাক্সিমা সাক্ষাৎ করেন।',
  'data_source': 'pmi'
}

Data Fields

idx (int): ID.
src (string): Sentence in source language (English).
tgt (string): Sentence in destination language (one of the 11 Indic languages).
data_source (string): Source of the data. For created data sources, depending on the destination language, it might be one of:
- anuvaad_catchnews
- anuvaad_DD_National
- anuvaad_DD_sports
- anuvaad_drivespark
- anuvaad_dw
- anuvaad_financialexpress
- anuvaad-general_corpus
- anuvaad_goodreturns
- anuvaad_indianexpress
- anuvaad_mykhel
- anuvaad_nativeplanet
- anuvaad_newsonair
- anuvaad_nouns_dictionary
- anuvaad_ocr
- anuvaad_oneindia
- anuvaad_pib
- anuvaad_pib_archives
- anuvaad_prothomalo
- anuvaad_timesofindia
- asianetnews
- betterindia
- bridge
- business_standard
- catchnews
- coursera
- dd_national
- dd_sports
- dwnews
- drivespark
- fin_express
- goodreturns
- gu_govt
- jagran-business
- jagran-education
- jagran-sports
- ie_business
- ie_education
- ie_entertainment
- ie_general
- ie_lifestyle
- ie_news
- ie_sports
- ie_tech
- indiccorp
- jagran-entertainment
- jagran-lifestyle
- jagran-news
- jagran-tech
- khan_academy
- Kurzgesagt
- marketfeed
- mykhel
- nativeplanet
- nptel
- ocr
- oneindia
- pa_govt
- pmi
- pranabmukherjee
- sakshi
- sentinel
- thewire
- toi
- tribune
- vsauce
- wikipedia
- zeebiz

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Creative Commons Attribution-NonCommercial 4.0 International .

Citation Information

@misc{ramesh2021samanantar,
      title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
      author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2021},
      eprint={2104.05596},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @albertvillanova for adding this dataset.

作者:

ai4bharat

数据集大小:

10.79 KB