数据集:
masakhane/masakhanews
任务:
文本分类子任务:
topic-classification计算机处理:
multilingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
afl-3.0MasakhaNEWS is the largest publicly available dataset for news topic classification in 16 languages widely spoken in Africa.
The train/validation/test sets are available for all the 16 languages.
[More Information Needed]
There are 16 languages available :
The examples look like this for Yorùbá:
from datasets import load_dataset data = load_dataset('masakhane/masakhanews', 'yor') # Please, specify the language code # A data point example is below: { 'label': 0, 'headline': "'The barriers to entry have gone - go for it now'", 'text': "j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 'headline_text': "'The barriers to entry have gone - go for it now' j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 'url': '/news/business-61880859' }
The news topics correspond to this list:
"business", "entertainment", "health", "politics", "religion", "sports", "technology"
For all languages, there are three splits.
The original splits were named train , dev and test and they correspond to the train , validation and test splits.
The splits have the following sizes :
Language | train | validation | test |
---|---|---|---|
Amharic | 1311 | 188 | 376 |
English | 3309 | 472 | 948 |
French | 1476 | 211 | 422 |
Hausa | 2219 | 317 | 637 |
Igbo | 1356 | 194 | 390 |
Lingala | 608 | 87 | 175 |
Luganda | 771 | 110 | 223 |
Oromo | 1015 | 145 | 292 |
Nigerian-Pidgin | 1060 | 152 | 305 |
Rundi | 1117 | 159 | 322 |
chiShona | 1288 | 185 | 369 |
Somali | 1021 | 148 | 294 |
Kiswahili | 1658 | 237 | 476 |
Tigrinya | 947 | 137 | 272 |
isiXhosa | 1032 | 147 | 297 |
Yoruba | 1433 | 206 | 411 |
The dataset was introduced to introduce new resources to 20 languages that were under-served for natural language processing.
[More Information Needed]
The source of the data is from the news domain, details can be found here ****
Initial Data Collection and NormalizationThe articles were word-tokenized, information on the exact pre-processing pipeline is unavailable.
Who are the source language producers?The source language was produced by journalists and writers employed by the news agency and newspaper mentioned above.
Details can be found here **
Who are the annotators?Annotators were recruited from Masakhane
The data is sourced from newspaper source and only contains mentions of public figures or individuals
[More Information Needed]
[More Information Needed]
Users should keep in mind that the dataset only contains news text, which might limit the applicability of the developed systems to other domains.
The licensing status of the data is CC 4.0 Non-Commercial
Provide the BibTex -formatted reference for the dataset. For example:
@article{Adelani2023MasakhaNEWS, title={MasakhaNEWS: News Topic Classification for African languages}, author={David Ifeoluwa Adelani and Marek Masiak and Israel Abebe Azime and Jesujoba Oluwadara Alabi and Atnafu Lambebo Tonja and Christine Mwase and Odunayo Ogundepo and Bonaventure F. P. Dossou and Akintunde Oladipo and Doreen Nixdorf and Chris Chinenye Emezue and Sana Sabah al-azzawi and Blessing K. Sibanda and Davis David and Lolwethu Ndolela and Jonathan Mukiibi and Tunde Oluwaseyi Ajayi and Tatiana Moteu Ngoli and Brian Odhiambo and Abraham Toluwase Owodunni and Nnaemeka C. Obiefuna and Shamsuddeen Hassan Muhammad and Saheed Salahudeen Abdullahi and Mesay Gemeda Yigezu and Tajuddeen Gwadabe and Idris Abdulmumin and Mahlet Taye Bame and Oluwabusayo Olufunke Awoyomi and Iyanuoluwa Shode and Tolulope Anu Adelani and Habiba Abdulganiy Kailani and Abdul-Hakeem Omotayo and Adetola Adeeko and Afolabi Abeeb and Anuoluwapo Aremu and Olanrewaju Samuel and Clemencia Siro and Wangari Kimotho and Onyekachi Raphael Ogbu and Chinedu E. Mbonu and Chiamaka I. Chukwuneke and Samuel Fanijo and Jessica Ojo and Oyinkansola F. Awosan and Tadesse Kebede Guge and Sakayo Toadoum Sari and Pamela Nyatsine and Freedmore Sidume and Oreen Yousuf and Mardiyyah Oduwole and Ussen Kimanuka and Kanda Patrick Tshinu and Thina Diko and Siyanda Nxakama and Abdulmejid Tuni Johar and Sinodos Gebre and Muhidin Mohamed and Shafie Abdi Mohamed and Fuad Mire Hassan and Moges Ahmed Mehamed and Evrard Ngabire and and Pontus Stenetorp}, journal={ArXiv}, year={2023}, volume={} }
Thanks to @dadelani for adding this dataset.