数据集:

ai4bharat/Bhasha-Abhijnaanam

任务:

语言:

计算机处理:

multilingual

语言创建人:

crowdsourced expert-generated machine-generated

源数据集:

original

预印本库:

arxiv:2305.15814

许可:

cc0-1.0

数据集介绍文件清单

中文

Dataset Card for Aksharantar

Dataset Summary

Bhasha-Abhijnaanam is a language identification test set for native-script as well as Romanized text which spans 22 Indic languages.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Assamese (asm)	Hindi (hin)	Maithili (mai)	Nepali (nep)	Sanskrit (san)	Tamil (tam)
Bengali (ben)	Kannada (kan)	Malayalam (mal)	Oriya (ori)	Santali (sat)	Telugu (tel)
Bodo(brx)	Kashmiri (kas)	Manipuri (mni)	Punjabi (pan)	Sindhi (snd)	Urdu (urd)
Gujarati (guj)	Konkani (kok)	Marathi (mar)

Dataset Structure

Data Instances

A random sample from Hindi (hin) Test dataset.
{
    "unique_identifier": "hin1", 
    "native sentence": "",
    "romanized sentence": "",
    "language": "Hindi", 
    "script": "Devanagari", 
    "source": "Dakshina",
}

Data Fields

unique_identifier (string): 3-letter language code followed by a unique number in Test set.
native sentence (string): A sentence in Indic language.
romanized sentence (string): Transliteration of native sentence in English (Romanized sentence).
language (string): Language of native sentence.
script (string): Script in which native sentence is written.
source (string): Source of the data.

For created data sources, depending on the destination/sampling method of a pair in a language, it will be one of:
- Dakshina Dataset
- Flores-200
- Manually Romanized
- Manually generated

Data Splits

Subset	asm	ben	brx	guj	hin	kan	kas (Perso-Arabic)	kas (Devanagari)	kok	mai	mal	mni (Bengali)	mni (Meetei Mayek)	mar	nep	ori	pan	san	sid	tam	tel	urd
Native	1012	5606	1500	5797	5617	5859	2511	1012	1500	2512	5628	1012	1500	5611	2512	1012	5776	2510	2512	5893	5779	5751
Romanized	512	4595	433	4785	4606	4848	450	0	444	439	4617	0	442	4603	423	512	4765	448	0	4881	4767	4741

Dataset Creation

Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Who are the source language producers?

[More Information Needed]

Annotations

Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Who are the annotators?

Information in the paper. Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

This data is released under the following licensing scheme:

Manually collected data: Released under CC0 license.

CC0 License Statement

We do not own any of the text from which this data has been extracted.
We license the actual packaging of manually collected data under the Creative Commons CC0 license (“no rights reserved”) .
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources.
This work is published from: India.

Citation Information

@misc{madhani2023bhashaabhijnaanam,
      title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages}, 
      author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan},
      year={2023},
      eprint={2305.15814},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

作者:

ai4bharat

数据集大小:

10.22 MB