数据集:

masakhane/masakhanews

中文

Dataset Card for [Dataset Name]

Dataset Summary

MasakhaNEWS is the largest publicly available dataset for news topic classification in 16 languages widely spoken in Africa.

The train/validation/test sets are available for all the 16 languages.

Supported Tasks and Leaderboards

[More Information Needed]

  • news topic classification : categorize news articles into new topics e.g business, sport sor politics.

Languages

There are 16 languages available :

  • Amharic (amh)
  • English (eng)
  • French (fra)
  • Hausa (hau)
  • Igbo (ibo)
  • Lingala (lin)
  • Luganda (lug)
  • Oromo (orm)
  • Nigerian Pidgin (pcm)
  • Rundi (run)
  • chShona (sna)
  • Somali (som)
  • Kiswahili (swą)
  • Tigrinya (tir)
  • isiXhosa (xho)
  • Yorùbá (yor)

Dataset Structure

Data Instances

The examples look like this for Yorùbá:

from datasets import load_dataset
data = load_dataset('masakhane/masakhanews', 'yor') 

# Please, specify the language code

# A data point example is below:

{
'label': 0, 
'headline': "'The barriers to entry have gone - go for it now'", 
'text': "j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 
'headline_text': "'The barriers to entry have gone - go for it now' j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 
'url': '/news/business-61880859'
}

Data Fields

  • label : news topic id
  • headline : news title/headline
  • text : news body
  • headline_text : concatenation of headline and news body
  • url : website address

The news topics correspond to this list:

"business", "entertainment", "health", "politics", "religion", "sports", "technology"

Data Splits

For all languages, there are three splits.

The original splits were named train , dev and test and they correspond to the train , validation and test splits.

The splits have the following sizes :

Language train validation test
Amharic 1311 188 376
English 3309 472 948
French 1476 211 422
Hausa 2219 317 637
Igbo 1356 194 390
Lingala 608 87 175
Luganda 771 110 223
Oromo 1015 145 292
Nigerian-Pidgin 1060 152 305
Rundi 1117 159 322
chiShona 1288 185 369
Somali 1021 148 294
Kiswahili 1658 237 476
Tigrinya 947 137 272
isiXhosa 1032 147 297
Yoruba 1433 206 411

Dataset Creation

Curation Rationale

The dataset was introduced to introduce new resources to 20 languages that were under-served for natural language processing.

[More Information Needed]

Source Data

The source of the data is from the news domain, details can be found here ****

Initial Data Collection and Normalization

The articles were word-tokenized, information on the exact pre-processing pipeline is unavailable.

Who are the source language producers?

The source language was produced by journalists and writers employed by the news agency and newspaper mentioned above.

Annotations

Annotation process

Details can be found here **

Who are the annotators?

Annotators were recruited from Masakhane

Personal and Sensitive Information

The data is sourced from newspaper source and only contains mentions of public figures or individuals

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Users should keep in mind that the dataset only contains news text, which might limit the applicability of the developed systems to other domains.

Additional Information

Dataset Curators

Licensing Information

The licensing status of the data is CC 4.0 Non-Commercial

Citation Information

Provide the BibTex -formatted reference for the dataset. For example:

@article{Adelani2023MasakhaNEWS,
  title={MasakhaNEWS: News Topic Classification for African languages},
  author={David Ifeoluwa Adelani and  Marek Masiak and  Israel Abebe Azime and  Jesujoba Oluwadara Alabi and  Atnafu Lambebo Tonja and  Christine Mwase and  Odunayo Ogundepo and  Bonaventure F. P. Dossou and  Akintunde Oladipo and  Doreen Nixdorf and  Chris Chinenye Emezue and  Sana Sabah al-azzawi and  Blessing K. Sibanda and  Davis David and  Lolwethu Ndolela and  Jonathan Mukiibi and  Tunde Oluwaseyi Ajayi and  Tatiana Moteu Ngoli and  Brian Odhiambo and  Abraham Toluwase Owodunni and  Nnaemeka C. Obiefuna and  Shamsuddeen Hassan Muhammad and  Saheed Salahudeen Abdullahi and  Mesay Gemeda Yigezu and  Tajuddeen Gwadabe and  Idris Abdulmumin and  Mahlet Taye Bame and  Oluwabusayo Olufunke Awoyomi and  Iyanuoluwa Shode and  Tolulope Anu Adelani and  Habiba Abdulganiy Kailani and  Abdul-Hakeem Omotayo and  Adetola Adeeko and  Afolabi Abeeb and  Anuoluwapo Aremu and  Olanrewaju Samuel and  Clemencia Siro and  Wangari Kimotho and  Onyekachi Raphael Ogbu and  Chinedu E. Mbonu and  Chiamaka I. Chukwuneke and  Samuel Fanijo and  Jessica Ojo and  Oyinkansola F. Awosan and  Tadesse Kebede Guge and  Sakayo Toadoum Sari and  Pamela Nyatsine and  Freedmore Sidume and  Oreen Yousuf and  Mardiyyah Oduwole and  Ussen Kimanuka and  Kanda Patrick Tshinu and  Thina Diko and  Siyanda Nxakama and   Abdulmejid Tuni Johar and  Sinodos Gebre and  Muhidin Mohamed and  Shafie Abdi Mohamed and  Fuad Mire Hassan and  Moges Ahmed Mehamed and  Evrard Ngabire and  and Pontus Stenetorp},
  journal={ArXiv},
  year={2023},
  volume={}
}

Contributions

Thanks to @dadelani for adding this dataset.