数据集:
ai4bharat/IndicWikiBio
计算机处理:
multilingual语言创建人:
found批注创建人:
no-annotation源数据集:
none. Originally generated from www.wikimedia.org. none.+Originally+generated+from+www.wikimedia.org.预印本库:
arxiv:2203.05437许可:
cc-by-nc-4.0The WikiBio dataset released as part of IndicNLG Suite. Each example has four fields: id, infobox, serialized infobox and summary. We create this dataset in nine languages including as, bn, hi, kn, ml, or, pa, ta, te. The total size of the dataset is 57,426.
Tasks: WikiBio
Leaderboards: Currently there is no Leaderboard for this dataset.
One random example from the hi dataset is given below in JSON format.
{ "id": 26, "infobox": "name_1:सी॰\tname_2:एल॰\tname_3:रुआला\toffice_1:सांसद\toffice_2:-\toffice_3:मिजोरम\toffice_4:लोक\toffice_5:सभा\toffice_6:निर्वाचन\toffice_7:क्षेत्र\toffice_8:।\toffice_9:मिजोरम\tterm_1:2014\tterm_2:से\tterm_3:2019\tnationality_1:भारतीय", "serialized_infobox": "<TAG> name </TAG> सी॰ एल॰ रुआला <TAG> office </TAG> सांसद - मिजोरम लोक सभा निर्वाचन क्षेत्र । मिजोरम <TAG> term </TAG> 2014 से 2019 <TAG> nationality </TAG> भारतीय", "summary": "सी॰ एल॰ रुआला भारत की सोलहवीं लोक सभा के सांसद हैं।" }
Here is the number of samples in each split for all the languages.
Language | ISO 639-1 Code | Train | Test | Val | ---------- | ---------- | ---------- | ---------- | ---------- | Assamese | as | 1,300 | 391 | 381 | Bengali | bn | 4,615 | 1,521 | 1,567 | Hindi | hi | 5,684 | 1,919 | 1,853 | Kannada | kn | 1,188 | 389 | 383 | Malayalam | ml | 5,620 | 1,835 | 1,896 | Oriya | or | 1,687 | 558 | 515 | Punjabi | pa | 3,796 | 1,227 | 1,331 | Tamil | ta | 8,169 | 2,701 | 2,632 | Telugu | te | 2,594 | 854 | 820 |
None
Initial Data Collection and Normalization Who are the source language producers?[More information needed]
Annotation process[More information needed]
Who are the annotators?[More information needed]
[More information needed]
[More information needed]
[More information needed]
[More information needed]
[More information needed]
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) . Copyright of the dataset contents belongs to the original copyright holders.
If you use any of the datasets, models or code modules, please cite the following paper:
@inproceedings{Kumar2022IndicNLGSM, title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages}, author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar}, year={2022}, url = "https://arxiv.org/abs/2203.05437",