数据集:

ai4bharat/naamapadam

计算机处理:

multilingual

大小:

1M<n<10M

语言创建人:

machine-generated

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:2212.10168

许可:

cc0-1.0
中文

Dataset Card for naamapadam

Dataset Summary

Naamapadam is the largest publicly available Named Entity Annotated dataset for 11 Indic languages. This corpora was created by projecting named entities from English side to the Indic language side of the English-Indic languages parallel corpus. The dataset additionally contains manually labelled test set for 8 Indic languages containing 500-1000 sentences.

Supported Tasks and Leaderboards

Tasks: NER on Indian languages.

Leaderboards: Currently there is no Leaderboard for this dataset.

Languages

  • Assamese (as)
  • Bengali (bn)
  • Gujarati (gu)
  • Kannada (kn)
  • Hindi (hi)
  • Malayalam (ml)
  • Marathi (mr)
  • Oriya (or)
  • Punjabi (pa)
  • Tamil (ta)
  • Telugu (te)

Dataset Structure

Data Instances

{'words': ['उन्हेनें', 'शिकांगों','में','बोरोडिन','की','पत्नी','को','तथा','वाशिंगटन','में','रूसी','व्यापार','संघ','को','पैसे','भेजे','।'], 'ner': [0, 3, 0, 1, 0, 0, 0, 0, 3, 0, 5, 6, 6, 0, 0, 0, 0], }

Data Fields

  • words : Raw tokens in the dataset.
  • ner : the NER tags for this dataset.

Data Splits

(to be updated, see paper for correct numbers)

Language Train Validation Test
as 10266 52 51
bn 961679 4859 607
gu 472845 2389 50
hi 985787 13460 437
kn 471763 2381 1019
ml 716652 3618 974
mr 455248 2300 1080
or 196793 993 994
pa 463534 2340 2342
ta 497882 2795 49
te 507741 2700 53

Usage

You should have the 'datasets' packages installed to be able to use the :rocket: HuggingFace datasets repository. Please use the following command and install via pip:

    pip install datasets

To use the dataset, please use:

    from datasets import load_dataset
    hiner = load_dataset('ai4bharat/naamapadam')

Dataset Creation

We use the parallel corpus from the Samanantar Dataset between English and the 11 major Indian languages to create the NER dataset. We annotate the English portion of the parallel corpus with existing state-of-the-art NER model. We use word-level alignments learned from the parallel corpus to project the entity labels from English to the Indian language.

Curation Rationale

naamapadam was built from Samanantar dataset . This dataset was built for the task of Named Entity Recognition in Indic languages. The dataset was introduced to introduce new resources to the Indic languages language that was under-served for Natural Language Processing.

Source Data

Samanantar dataset

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

NER annotations were done following the CoNLL-2003 guidelines.

Who are the annotators?

The annotations for the testset have been done by volunteers who are proficient in the respective languages. We would like to thank all the volunteers:

  • Anil Mhaske
  • Anoop Kunchukuttan
  • Archana Mhaske
  • Arnav Mhaske
  • Gowtham Ramesh
  • Harshit Kedia
  • Nitin Kedia
  • Rudramurthy V
  • Sangeeta Rajagopal
  • Sumanth Doddapaneni
  • Vindhya DS
  • Yash Madhani
  • Kabir Ahuja
  • Shallu Rani
  • Armin Virk

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

The purpose of this dataset is to provide a large-scale Named Entity Recognition dataset for Indic languages. Since the information (data points) has been obtained from public resources, we do not think there is a negative social impact in releasing this data.

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

CC0 License Statement

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”) .
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Naamapadam manually collected data and existing sources.
  • This work is published from: India.

Citation Information

If you are using the Naampadam corpus, please cite the following article:

@misc{mhaske2022naamapadam,
  doi = {10.48550/ARXIV.2212.10168},
  url = {https://arxiv.org/abs/2212.10168},
  author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop},
  title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages}
  publisher = {arXiv},
  year = {2022},
}

Contributors

This work is the outcome of a volunteer effort as part of the AI4Bharat initiative .

Contact

  • Anoop Kunchukuttan ( anoop.kunchukuttan@gmail.com )
  • Rudra Murthy V ( rmurthyv@in.ibm.com )