数据集:
ai4bharat/Aksharantar
Aksharantar is the largest publicly available transliteration dataset for 20 Indic languages. The corpus has 26M Indic language-English transliteration pairs.
[More Information Needed]
Assamese (asm) | Hindi (hin) | Maithili (mai) | Marathi (mar) | Punjabi (pan) | Tamil (tam) |
Bengali (ben) | Kannada (kan) | Malayalam (mal) | Nepali (nep) | Sanskrit (san) | Telugu (tel) |
Bodo(brx) | Kashmiri (kas) | Manipuri (mni) | Oriya (ori) | Sindhi (snd) | Urdu (urd) |
Gujarati (guj) | Konkani (kok) |
A random sample from Hindi (hin) Train dataset. { 'unique_identifier': 'hin1241393', 'native word': 'स्वाभिमानिक', 'english word': 'swabhimanik', 'source': 'IndicCorp', 'score': -0.1028788579 }
unique_identifier (string): 3-letter language code followed by a unique number in each set (Train, Test, Val).
native word (string): A word in Indic language.
english word (string): Transliteration of native word in English (Romanised word).
source (string): Source of the data.
score (num): Character level log probability of indic word given roman word by IndicXlit (model). Pairs with average threshold of the 0.35 are considered.
For created data sources, depending on the destination/sampling method of a pair in a language, it will be one of:
Subset | asm-en | ben-en | brx-en | guj-en | hin-en | kan-en | kas-en | kok-en | mai-en | mal-en | mni-en | mar-en | nep-en | ori-en | pan-en | san-en | sid-en | tam-en | tel-en | urd-en |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training | 179K | 1231K | 36K | 1143K | 1299K | 2907K | 47K | 613K | 283K | 4101K | 10K | 1453K | 2397K | 346K | 515K | 1813K | 60K | 3231K | 2430K | 699K |
Validation | 4K | 11K | 3K | 12K | 6K | 7K | 4K | 4K | 4K | 8K | 3K | 8K | 3K | 3K | 9K | 3K | 8K | 9K | 8K | 12K |
Test | 5531 | 5009 | 4136 | 7768 | 5693 | 6396 | 7707 | 5093 | 5512 | 6911 | 4925 | 6573 | 4133 | 4256 | 4316 | 5334 | - | 4682 | 4567 | 4463 |
Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users
[More Information Needed]
Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users
Who are the source language producers?[More Information Needed]
Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users
Annotation processInformation in the paper. Aksharantar: Towards building open transliteration tools for the next billion users
Who are the annotators?Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
This data is released under the following licensing scheme:
CC-BY License
CC0 License Statement
@misc{madhani2022aksharantar, title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, year={2022}, eprint={}, archivePrefix={arXiv}, primaryClass={cs.CL} }