数据集:
masakhane/masakhaner2
任务:
标记分类计算机处理:
multilingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
afl-3.0MasakhaNER 2.0 is the largest publicly available high-quality dataset for named entity recognition (NER) in 20 African languages created by the Masakhane community.
Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:
[PER Wolff] , currently a journalist in [LOC Argentina] , played with [PER Del Bosque] in the final years of the seventies in [ORG Real Madrid] .
MasakhaNER 2.0 is a named entity dataset consisting of PER, ORG, LOC, and DATE entities annotated by Masakhane for 20 African languages
The train/validation/test sets are available for all the 20 languages.
For more details see https://arxiv.org/abs/2210.12391
[More Information Needed]
There are 20 languages available :
The examples look like this for Yorùbá:
from datasets import load_dataset data = load_dataset('masakhane/masakhaner2', 'yor') # Please, specify the language code # A data point consists of sentences seperated by empty line and tab-seperated tokens and tags. {'id': '0', 'ner_tags': [B-DATE, I-DATE, 0, 0, 0, 0, 0, B-PER, I-PER, I-PER, O, O, O, O], 'tokens': ['Wákàtí', 'méje', 'ti', 'ré', 'kọjá', 'lọ', 'tí', 'Luis', 'Carlos', 'Díaz', 'ti', 'di', 'awati', '.'] }
The NER tags correspond to this list:
"O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-DATE", "I-DATE",
In the NER tags, a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and dates & time (DATE).
It is assumed that named entities are non-recursive and non-overlapping. In case a named entity is embedded in another named entity usually, only the top level entity is marked.
For all languages, there are three splits.
The original splits were named train , dev and test and they correspond to the train , validation and test splits.
The splits have the following sizes :
Language | train | validation | test |
---|---|---|---|
Bambara | 4463 | 638 | 1274 |
Ghomala | 3384 | 483 | 966 |
Ewe | 3505 | 501 | 1001 |
Fon. | 4343 | 621 | 1240 |
Hausa | 5716 | 816 | 1633 |
Igbo | 7634 | 1090 | 2181 |
Kinyarwanda | 7825 | 1118 | 2235 |
Luganda | 4942 | 706 | 1412 |
Luo | 5161 | 737 | 1474 |
Mossi | 4532 | 648 | 1613 |
Nigerian-Pidgin | 5646 | 806 | 1294 |
Chichewa | 6250 | 893 | 1785 |
chiShona | 6207 | 887 | 1773 |
Kiswahili | 6593 | 942 | 1883 |
Setswana | 3289 | 499 | 996 |
Akan/Twi | 4240 | 605 | 1211 |
Wolof | 4593 | 656 | 1312 |
isiXhosa | 5718 | 817 | 1633 |
Yoruba | 6877 | 983 | 1964 |
isiZulu | 5848 | 836 | 1670 |
The dataset was introduced to introduce new resources to 20 languages that were under-served for natural language processing.
[More Information Needed]
The source of the data is from the news domain, details can be found here https://arxiv.org/abs/2210.12391
Initial Data Collection and NormalizationThe articles were word-tokenized, information on the exact pre-processing pipeline is unavailable.
Who are the source language producers?The source language was produced by journalists and writers employed by the news agency and newspaper mentioned above.
Details can be found here https://arxiv.org/abs/2103.11811
Who are the annotators?Annotators were recruited from Masakhane
The data is sourced from newspaper source and only contains mentions of public figures or individuals
[More Information Needed]
[More Information Needed]
Users should keep in mind that the dataset only contains news text, which might limit the applicability of the developed systems to other domains.
The licensing status of the data is CC 4.0 Non-Commercial
Provide the BibTex -formatted reference for the dataset. For example:
@article{Adelani2022MasakhaNER2A, title={MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition}, author={David Ifeoluwa Adelani and Graham Neubig and Sebastian Ruder and Shruti Rijhwani and Michael Beukman and Chester Palen-Michel and Constantine Lignos and Jesujoba Oluwadara Alabi and Shamsuddeen Hassan Muhammad and Peter Nabende and Cheikh M. Bamba Dione and Andiswa Bukula and Rooweither Mabuya and Bonaventure F. P. Dossou and Blessing K. Sibanda and Happy Buzaaba and Jonathan Mukiibi and Godson Kalipe and Derguene Mbaye and Amelia Taylor and Fatoumata Kabore and Chris C. Emezue and Anuoluwapo Aremu and Perez Ogayo and Catherine W. Gitau and Edwin Munkoh-Buabeng and Victoire Memdjokam Koagne and Allahsera Auguste Tapo and Tebogo Macucwa and Vukosi Marivate and Elvis Mboning and Tajuddeen R. Gwadabe and Tosin P. Adewumi and Orevaoghene Ahia and Joyce Nakatumba-Nabende and Neo L. Mokono and Ignatius M Ezeani and Chiamaka Ijeoma Chukwuneke and Mofetoluwa Adeyemi and Gilles Hacheme and Idris Abdulmumin and Odunayo Ogundepo and Oreen Yousuf and Tatiana Moteu Ngoli and Dietrich Klakow}, journal={ArXiv}, year={2022}, volume={abs/2210.12391} }
Thanks to @dadelani for adding this dataset.