数据集:

cfilt/HiNER-collapsed

语言:

hi

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2204.13743
中文

Dataset Card for HiNER-original

Dataset Summary

This dataset was created for the fundamental NLP task of Named Entity Recognition for the Hindi language at CFILT Lab, IIT Bombay. We gathered the dataset from various government information webpages and manually annotated these sentences as a part of our data collection strategy.

Note: The dataset contains sentences from ILCI and other sources. ILCI dataset requires license from Indian Language Consortium due to which we do not distribute the ILCI portion of the data. Please send us a mail with proof of ILCI data acquisition to obtain the full dataset.

Supported Tasks and Leaderboards

Named Entity Recognition

Languages

Hindi

Dataset Structure

Data Instances

{'id': '0', 'tokens': ['प्राचीन', 'समय', 'में', 'उड़ीसा', 'को', 'कलिंग', 'के', 'नाम', 'से', 'जाना', 'जाता', 'था', '।'], 'ner_tags': [0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0]}

Data Fields

  • id : The ID value of the data point.
  • tokens : Raw tokens in the dataset.
  • ner_tags : the NER tags for this dataset.

Data Splits

Train Valid Test
original 76025 10861 21722
collapsed 76025 10861 21722

About

This repository contains the Hindi Named Entity Recognition dataset (HiNER) published at the Langauge Resources and Evaluation conference (LREC) in 2022. A pre-print via arXiv is available here .

Recent Updates

  • Version 0.0.5: HiNER initial release

Usage

You should have the 'datasets' packages installed to be able to use the :rocket: HuggingFace datasets repository. Please use the following command and install via pip:

    pip install datasets

To use the original dataset with all the tags, please use:

    from datasets import load_dataset
    hiner = load_dataset('cfilt/HiNER-original')

To use the collapsed dataset with only PER, LOC, and ORG tags, please use:

    from datasets import load_dataset
    hiner = load_dataset('cfilt/HiNER-collapsed')

However, the CoNLL format dataset files can also be found on this Git repository under the data folder.

Model(s)

Our best performing models are hosted on the HuggingFace models repository:

  • HiNER-Collapsed-XLM-R
  • HiNER-Original-XLM-R
  • Dataset Creation

    Curation Rationale

    HiNER was built on data extracted from various government websites handled by the Government of India which provide information in Hindi. This dataset was built for the task of Named Entity Recognition. The dataset was introduced to introduce new resources to the Hindi language that was under-served for Natural Language Processing.

    Source Data

    Initial Data Collection and Normalization

    HiNER was built on data extracted from various government websites handled by the Government of India which provide information in Hindi

    Who are the source language producers?

    Various Government of India webpages

    Annotations

    Annotation process

    This dataset was manually annotated by a single annotator of a long span of time.

    Who are the annotators?

    Pallab Bhattacharjee

    Personal and Sensitive Information

    We ensured that there was no sensitive information present in the dataset. All the data points are curated from publicly available information.

    Considerations for Using the Data

    Social Impact of Dataset

    The purpose of this dataset is to provide a large Hindi Named Entity Recognition dataset. Since the information (data points) has been obtained from public resources, we do not think there is a negative social impact in releasing this data.

    Discussion of Biases

    Any biases contained in the data released by the Indian government are bound to be present in our data.

    Other Known Limitations

    [Needs More Information]

    Additional Information

    Dataset Curators

    Pallab Bhattacharjee

    Licensing Information

    CC-BY-SA 4.0

    Citation Information

    @misc{https://doi.org/10.48550/arxiv.2204.13743,
      doi = {10.48550/ARXIV.2204.13743},
      url = {https://arxiv.org/abs/2204.13743},
      author = {Murthy, Rudra and Bhattacharjee, Pallab and Sharnagat, Rahul and Khatri, Jyotsana and Kanojia, Diptesh and Bhattacharyya, Pushpak},
      keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
      title = {HiNER: A Large Hindi Named Entity Recognition Dataset},
      publisher = {arXiv},
      year = {2022},
      copyright = {Creative Commons Attribution 4.0 International}
    }