cc-by-sa-4.0This dataset was created for the fundamental NLP task of Named Entity Recognition for the Hindi language at CFILT Lab, IIT Bombay. We gathered the dataset from various government information webpages and manually annotated these sentences as a part of our data collection strategy.
Note: The dataset contains sentences from ILCI and other sources. ILCI dataset requires license from Indian Language Consortium due to which we do not distribute the ILCI portion of the data. Please send us a mail with proof of ILCI data acquisition to obtain the full dataset.
Named Entity Recognition
{'id': '0', 'tokens': ['प्राचीन', 'समय', 'में', 'उड़ीसा', 'को', 'कलिंग','के', 'नाम', 'से', 'जाना', 'जाता', 'था', '।'], 'ner_tags': [0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0]}
Train | Valid | Test | |
original | 76025 | 10861 | 21722 |
collapsed | 76025 | 10861 | 21722 |
This repository contains the Hindi Named Entity Recognition dataset (HiNER) published at the Langauge Resources and Evaluation conference (LREC) in 2022. A pre-print via arXiv is available here .
You should have the 'datasets' packages installed to be able to use the :rocket: HuggingFace datasets repository. Please use the following command and install via pip:
pip install datasets
To use the original dataset with all the tags, please use:
from datasets import load_dataset hiner = load_dataset('cfilt/HiNER-original')
To use the collapsed dataset with only PER, LOC, and ORG tags, please use:
from datasets import load_dataset hiner = load_dataset('cfilt/HiNER-collapsed')
However, the CoNLL format dataset files can also be found on this Git repository under the data folder.
Our best performing models are hosted on the HuggingFace models repository:
HiNER was built on data extracted from various government websites handled by the Government of India which provide information in Hindi. This dataset was built for the task of Named Entity Recognition. The dataset was introduced to introduce new resources to the Hindi language that was under-served for Natural Language Processing.
HiNER was built on data extracted from various government websites handled by the Government of India which provide information in Hindi
Who are the source language producers?Various Government of India webpages
This dataset was manually annotated by a single annotator of a long span of time.
Who are the annotators?Pallab Bhattacharjee
We ensured that there was no sensitive information present in the dataset. All the data points are curated from publicly available information.
The purpose of this dataset is to provide a large Hindi Named Entity Recognition dataset. Since the information (data points) has been obtained from public resources, we do not think there is a negative social impact in releasing this data.
Any biases contained in the data released by the Indian government are bound to be present in our data.
[Needs More Information]
Pallab Bhattacharjee
CC-BY-SA 4.0
@misc{https://doi.org/10.48550/arxiv.2204.13743, doi = {10.48550/ARXIV.2204.13743}, url = {https://arxiv.org/abs/2204.13743}, author = {Murthy, Rudra and Bhattacharjee, Pallab and Sharnagat, Rahul and Khatri, Jyotsana and Kanojia, Diptesh and Bhattacharyya, Pushpak}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {HiNER: A Large Hindi Named Entity Recognition Dataset}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} }