数据集:

ai4bharat/Aksharantar

中文

Dataset Card for Aksharantar

Dataset Summary

Aksharantar is the largest publicly available transliteration dataset for 20 Indic languages. The corpus has 26M Indic language-English transliteration pairs.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Assamese (asm) Hindi (hin) Maithili (mai) Marathi (mar) Punjabi (pan) Tamil (tam)
Bengali (ben) Kannada (kan) Malayalam (mal) Nepali (nep) Sanskrit (san) Telugu (tel)
Bodo(brx) Kashmiri (kas) Manipuri (mni) Oriya (ori) Sindhi (snd) Urdu (urd)
Gujarati (guj) Konkani (kok)

Dataset Structure

Data Instances

A random sample from Hindi (hin) Train dataset.

{
'unique_identifier': 'hin1241393', 
'native word': 'स्वाभिमानिक', 
'english word': 'swabhimanik', 
'source': 'IndicCorp', 
'score': -0.1028788579
}

Data Fields

  • unique_identifier (string): 3-letter language code followed by a unique number in each set (Train, Test, Val).

  • native word (string): A word in Indic language.

  • english word (string): Transliteration of native word in English (Romanised word).

  • source (string): Source of the data.

  • score (num): Character level log probability of indic word given roman word by IndicXlit (model). Pairs with average threshold of the 0.35 are considered.

    For created data sources, depending on the destination/sampling method of a pair in a language, it will be one of:

    • Dakshina Dataset
    • IndicCorp
    • Samanantar
    • Wikidata
    • Existing sources
    • Named Entities Indian (AK-NEI)
    • Named Entities Foreign (AK-NEF)
    • Data from Uniform Sampling method. (Ak-Uni)
    • Data from Most Frequent words sampling method. (Ak-Freq)

Data Splits

Subset asm-en ben-en brx-en guj-en hin-en kan-en kas-en kok-en mai-en mal-en mni-en mar-en nep-en ori-en pan-en san-en sid-en tam-en tel-en urd-en
Training 179K 1231K 36K 1143K 1299K 2907K 47K 613K 283K 4101K 10K 1453K 2397K 346K 515K 1813K 60K 3231K 2430K 699K
Validation 4K 11K 3K 12K 6K 7K 4K 4K 4K 8K 3K 8K 3K 3K 9K 3K 8K 9K 8K 12K
Test 5531 5009 4136 7768 5693 6396 7707 5093 5512 6911 4925 6573 4133 4256 4316 5334 - 4682 4567 4463

Dataset Creation

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Who are the source language producers?

[More Information Needed]

Annotations

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Annotation process

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Who are the annotators?

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

This data is released under the following licensing scheme:

  • Manually collected data: Released under CC-BY license.
  • Mined dataset (from Samanantar and IndicCorp): Released under CC0 license.
  • Existing sources: Released under CC0 license.

CC-BY License

CC0 License Statement

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”) .
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources.
  • This work is published from: India.

Citation Information

@misc{madhani2022aksharantar,
      title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, 
      author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2022},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions