数据集:
ai4bharat/naamapadam
任务:
标记分类计算机处理:
multilingual大小:
1M<n<10M语言创建人:
machine-generated批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:2212.10168许可:
cc0-1.0Naamapadam is the largest publicly available Named Entity Annotated dataset for 11 Indic languages. This corpora was created by projecting named entities from English side to the Indic language side of the English-Indic languages parallel corpus. The dataset additionally contains manually labelled test set for 8 Indic languages containing 500-1000 sentences.
Tasks: NER on Indian languages.
Leaderboards: Currently there is no Leaderboard for this dataset.
{'words': ['उन्हेनें', 'शिकांगों','में','बोरोडिन','की','पत्नी','को','तथा','वाशिंगटन','में','रूसी','व्यापार','संघ','को','पैसे','भेजे','।'], 'ner': [0, 3, 0, 1, 0, 0, 0, 0, 3, 0, 5, 6, 6, 0, 0, 0, 0], }
(to be updated, see paper for correct numbers)
Language | Train | Validation | Test |
---|---|---|---|
as | 10266 | 52 | 51 |
bn | 961679 | 4859 | 607 |
gu | 472845 | 2389 | 50 |
hi | 985787 | 13460 | 437 |
kn | 471763 | 2381 | 1019 |
ml | 716652 | 3618 | 974 |
mr | 455248 | 2300 | 1080 |
or | 196793 | 993 | 994 |
pa | 463534 | 2340 | 2342 |
ta | 497882 | 2795 | 49 |
te | 507741 | 2700 | 53 |
You should have the 'datasets' packages installed to be able to use the :rocket: HuggingFace datasets repository. Please use the following command and install via pip:
pip install datasets
To use the dataset, please use:
from datasets import load_dataset hiner = load_dataset('ai4bharat/naamapadam')
We use the parallel corpus from the Samanantar Dataset between English and the 11 major Indian languages to create the NER dataset. We annotate the English portion of the parallel corpus with existing state-of-the-art NER model. We use word-level alignments learned from the parallel corpus to project the entity labels from English to the Indian language.
naamapadam was built from Samanantar dataset . This dataset was built for the task of Named Entity Recognition in Indic languages. The dataset was introduced to introduce new resources to the Indic languages language that was under-served for Natural Language Processing.
[Needs More Information]
Who are the source language producers?[Needs More Information]
NER annotations were done following the CoNLL-2003 guidelines.
Who are the annotators?The annotations for the testset have been done by volunteers who are proficient in the respective languages. We would like to thank all the volunteers:
[Needs More Information]
The purpose of this dataset is to provide a large-scale Named Entity Recognition dataset for Indic languages. Since the information (data points) has been obtained from public resources, we do not think there is a negative social impact in releasing this data.
[Needs More Information]
[Needs More Information]
[Needs More Information]
If you are using the Naampadam corpus, please cite the following article:
@misc{mhaske2022naamapadam, doi = {10.48550/ARXIV.2212.10168}, url = {https://arxiv.org/abs/2212.10168}, author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop}, title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages} publisher = {arXiv}, year = {2022}, }
This work is the outcome of a volunteer effort as part of the AI4Bharat initiative .