数据集:

masakhane/masakhanews

任务:

文本分类

子任务:

topic-classification

语言:

计算机处理:

multilingual

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

其他:

news-topic masakhanews masakhane

许可:

afl-3.0

数据集介绍文件清单

中文

Dataset Card for [Dataset Name]

Dataset Summary

MasakhaNEWS is the largest publicly available dataset for news topic classification in 16 languages widely spoken in Africa.

The train/validation/test sets are available for all the 16 languages.

Supported Tasks and Leaderboards

[More Information Needed]

news topic classification : categorize news articles into new topics e.g business, sport sor politics.

Languages

There are 16 languages available :

Amharic (amh)
English (eng)
French (fra)
Hausa (hau)
Igbo (ibo)
Lingala (lin)
Luganda (lug)
Oromo (orm)
Nigerian Pidgin (pcm)
Rundi (run)
chShona (sna)
Somali (som)
Kiswahili (swą)
Tigrinya (tir)
isiXhosa (xho)
Yorùbá (yor)

Dataset Structure

Data Instances

The examples look like this for Yorùbá:

from datasets import load_dataset
data = load_dataset('masakhane/masakhanews', 'yor') 

# Please, specify the language code

# A data point example is below:

{
'label': 0, 
'headline': "'The barriers to entry have gone - go for it now'", 
'text': "j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 
'headline_text': "'The barriers to entry have gone - go for it now' j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 
'url': '/news/business-61880859'
}

Data Fields

label : news topic id
headline : news title/headline
text : news body
headline_text : concatenation of headline and news body
url : website address

The news topics correspond to this list:

"business", "entertainment", "health", "politics", "religion", "sports", "technology"

Data Splits

For all languages, there are three splits.

The original splits were named train , dev and test and they correspond to the train , validation and test splits.

The splits have the following sizes :

Language	train	validation	test
Amharic	1311	188	376
English	3309	472	948
French	1476	211	422
Hausa	2219	317	637
Igbo	1356	194	390
Lingala	608	87	175
Luganda	771	110	223
Oromo	1015	145	292
Nigerian-Pidgin	1060	152	305
Rundi	1117	159	322
chiShona	1288	185	369
Somali	1021	148	294
Kiswahili	1658	237	476
Tigrinya	947	137	272
isiXhosa	1032	147	297
Yoruba	1433	206	411

Dataset Creation

Curation Rationale

The dataset was introduced to introduce new resources to 20 languages that were under-served for natural language processing.

[More Information Needed]

Source Data

The source of the data is from the news domain, details can be found here ****

Initial Data Collection and Normalization

The articles were word-tokenized, information on the exact pre-processing pipeline is unavailable.

Who are the source language producers?

The source language was produced by journalists and writers employed by the news agency and newspaper mentioned above.

Annotations

Annotation process

Details can be found here **

Who are the annotators?

Annotators were recruited from Masakhane

Personal and Sensitive Information

The data is sourced from newspaper source and only contains mentions of public figures or individuals

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Users should keep in mind that the dataset only contains news text, which might limit the applicability of the developed systems to other domains.

Additional Information

Dataset Curators

Licensing Information

The licensing status of the data is CC 4.0 Non-Commercial

Citation Information

Provide the BibTex -formatted reference for the dataset. For example:

@article{Adelani2023MasakhaNEWS,
  title={MasakhaNEWS: News Topic Classification for African languages},
  author={David Ifeoluwa Adelani and  Marek Masiak and  Israel Abebe Azime and  Jesujoba Oluwadara Alabi and  Atnafu Lambebo Tonja and  Christine Mwase and  Odunayo Ogundepo and  Bonaventure F. P. Dossou and  Akintunde Oladipo and  Doreen Nixdorf and  Chris Chinenye Emezue and  Sana Sabah al-azzawi and  Blessing K. Sibanda and  Davis David and  Lolwethu Ndolela and  Jonathan Mukiibi and  Tunde Oluwaseyi Ajayi and  Tatiana Moteu Ngoli and  Brian Odhiambo and  Abraham Toluwase Owodunni and  Nnaemeka C. Obiefuna and  Shamsuddeen Hassan Muhammad and  Saheed Salahudeen Abdullahi and  Mesay Gemeda Yigezu and  Tajuddeen Gwadabe and  Idris Abdulmumin and  Mahlet Taye Bame and  Oluwabusayo Olufunke Awoyomi and  Iyanuoluwa Shode and  Tolulope Anu Adelani and  Habiba Abdulganiy Kailani and  Abdul-Hakeem Omotayo and  Adetola Adeeko and  Afolabi Abeeb and  Anuoluwapo Aremu and  Olanrewaju Samuel and  Clemencia Siro and  Wangari Kimotho and  Onyekachi Raphael Ogbu and  Chinedu E. Mbonu and  Chiamaka I. Chukwuneke and  Samuel Fanijo and  Jessica Ojo and  Oyinkansola F. Awosan and  Tadesse Kebede Guge and  Sakayo Toadoum Sari and  Pamela Nyatsine and  Freedmore Sidume and  Oreen Yousuf and  Mardiyyah Oduwole and  Ussen Kimanuka and  Kanda Patrick Tshinu and  Thina Diko and  Siyanda Nxakama and   Abdulmejid Tuni Johar and  Sinodos Gebre and  Muhidin Mohamed and  Shafie Abdi Mohamed and  Fuad Mire Hassan and  Moges Ahmed Mehamed and  Evrard Ngabire and  and Pontus Stenetorp},
  journal={ArXiv},
  year={2023},
  volume={}
}

Contributions

Thanks to @dadelani for adding this dataset.

作者:

masakhane

数据集大小:

17.35 KB