数据集:
ai4bharat/Bhasha-Abhijnaanam
Bhasha-Abhijnaanam is a language identification test set for native-script as well as Romanized text which spans 22 Indic languages.
[More Information Needed]
Assamese (asm) | Hindi (hin) | Maithili (mai) | Nepali (nep) | Sanskrit (san) | Tamil (tam) |
Bengali (ben) | Kannada (kan) | Malayalam (mal) | Oriya (ori) | Santali (sat) | Telugu (tel) |
Bodo(brx) | Kashmiri (kas) | Manipuri (mni) | Punjabi (pan) | Sindhi (snd) | Urdu (urd) |
Gujarati (guj) | Konkani (kok) | Marathi (mar) |
A random sample from Hindi (hin) Test dataset. { "unique_identifier": "hin1", "native sentence": "", "romanized sentence": "", "language": "Hindi", "script": "Devanagari", "source": "Dakshina", }
unique_identifier (string): 3-letter language code followed by a unique number in Test set.
native sentence (string): A sentence in Indic language.
romanized sentence (string): Transliteration of native sentence in English (Romanized sentence).
language (string): Language of native sentence.
script (string): Script in which native sentence is written.
source (string): Source of the data.
For created data sources, depending on the destination/sampling method of a pair in a language, it will be one of:
Subset | asm | ben | brx | guj | hin | kan | kas (Perso-Arabic) | kas (Devanagari) | kok | mai | mal | mni (Bengali) | mni (Meetei Mayek) | mar | nep | ori | pan | san | sid | tam | tel | urd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Native | 1012 | 5606 | 1500 | 5797 | 5617 | 5859 | 2511 | 1012 | 1500 | 2512 | 5628 | 1012 | 1500 | 5611 | 2512 | 1012 | 5776 | 2510 | 2512 | 5893 | 5779 | 5751 |
Romanized | 512 | 4595 | 433 | 4785 | 4606 | 4848 | 450 | 0 | 444 | 439 | 4617 | 0 | 442 | 4603 | 423 | 512 | 4765 | 448 | 0 | 4881 | 4767 | 4741 |
Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages
[More Information Needed]
Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages
Who are the source language producers?[More Information Needed]
Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages
Who are the annotators?Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
This data is released under the following licensing scheme:
CC0 License Statement
@misc{madhani2023bhashaabhijnaanam, title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages}, author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan}, year={2023}, eprint={2305.15814}, archivePrefix={arXiv}, primaryClass={cs.CL} }