数据集:
yoruba_gv_ner
许可:
cc-by-3.0源数据集:
original批注创建人:
expert-generated语言创建人:
expert-generated大小:
1K<n<10K计算机处理:
monolingual语言:
yo任务:
标记分类The Yoruba GV NER is a named entity recognition (NER) dataset for Yorùbá language based on the Global Voices news corpus. Global Voices (GV) is a multilingual news platform with articles contributed by journalists, translators, bloggers, and human rights activists from around the world with a coverage of over 50 languages. Most of the texts used in creating the Yoruba GV NER are translations from other languages to Yorùbá.
[More Information Needed]
The language supported is Yorùbá.
A data point consists of sentences seperated by empty line and tab-seperated tokens and tags. {'id': '0', 'ner_tags': [B-LOC, 0, 0, 0, 0], 'tokens': ['Tanzania', 'fi', 'Ajìjàgbara', 'Ọmọ', 'Orílẹ̀-èdèe'] }
The NER tags correspond to this list:
"O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-DATE", "I-DATE",
The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and dates & times (DATE). (O) is used for tokens not considered part of any named entity.
Training (19,421 tokens), validation (2,695 tokens) and test split (5,235 tokens)
The data was created to help introduce resources to new language - Yorùbá.
[More Information Needed]
The dataset is based on the news domain and was crawled from Global Voices Yorùbá news .
[More Information Needed]
Who are the source language producers?The dataset contributed by journalists, translators, bloggers, and human rights activists from around the world. Most of the texts used in creating the Yoruba GV NER are translations from other languages to Yorùbá [More Information Needed]
[More Information Needed]
Who are the annotators?The data was annotated by Jesujoba Alabi and David Adelani for the paper: Massive vs. Curated Embeddings for Low-Resourced Languages: the case of Yorùbá and Twi .
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The annotated data sets were developed by students of Saarland University, Saarbrücken, Germany .
The data is under the Creative Commons Attribution 3.0
@inproceedings{alabi-etal-2020-massive, title = "Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of {Y}or{\`u}b{\'a} and {T}wi", author = "Alabi, Jesujoba and Amponsah-Kaakyire, Kwabena and Adelani, David and Espa{\~n}a-Bonet, Cristina", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.335", pages = "2754--2762", language = "English", ISBN = "979-10-95546-34-4", }
Thanks to @dadelani for adding this dataset.