数据集:
hausa_voa_ner
任务:
标记分类语言:
ha计算机处理:
monolingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0The Hausa VOA NER is a named entity recognition (NER) dataset for Hausa language based on the VOA Hausa news corpus.
[More Information Needed]
The language supported is Hausa.
A data point consists of sentences seperated by empty line and tab-seperated tokens and tags. {'id': '0', 'ner_tags': [B-PER, 0, 0, B-LOC, 0], 'tokens': ['Trump', 'ya', 'ce', 'Rasha', 'ma'] }
The NER tags correspond to this list:
"O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-DATE", "I-DATE",
The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and dates & times (DATE). (O) is used for tokens not considered part of any named entity.
Training (1,014 sentences), validation (145 sentences) and test split (291 sentences)
The data was created to help introduce resources to new language - Hausa.
[More Information Needed]
The dataset is based on the news domain and was crawled from VOA Hausa news .
[More Information Needed]
Who are the source language producers?The dataset was collected from VOA Hausa news. Most of the texts used in creating the Hausa VOA NER are news stories from Nigeria, Niger Republic, United States, and other parts of the world.
[More Information Needed]
Named entity recognition annotation
Annotation process[More Information Needed]
Who are the annotators?The data was annotated by Jesujoba Alabi and David Adelani for the paper: Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages .
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The annotated data sets were developed by students of Saarland University, Saarbrücken, Germany .
The data is under the Creative Commons Attribution 4.0
@inproceedings{hedderich-etal-2020-transfer, title = "Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on {A}frican Languages", author = "Hedderich, Michael A. and Adelani, David and Zhu, Dawei and Alabi, Jesujoba and Markus, Udia and Klakow, Dietrich", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.204", doi = "10.18653/v1/2020.emnlp-main.204", pages = "2580--2591", }
Thanks to @dadelani for adding this dataset.