数据集:

kinnews_kirnews

语言:

rn rw

计算机处理:

monolingual

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2010.12174

许可:

mit
中文

Dataset Card for kinnews_kirnews

Dataset Summary

Kinyarwanda and Kirundi news classification datasets (KINNEWS and KIRNEWS,respectively), which were both collected from Rwanda and Burundi news websites and newspapers, for low-resource monolingual and cross-lingual multiclass classification tasks.

Supported Tasks and Leaderboards

This dataset can be used for text classification of news articles in Kinyarwadi and Kirundi languages. Each news article can be classified into one of the 14 possible classes. The classes are:

  • politics
  • sport
  • economy
  • health
  • entertainment
  • history
  • technology
  • culture
  • religion
  • environment
  • education
  • relationship

Languages

Kinyarwanda and Kirundi

Dataset Structure

Data Instances

Here is an example from the dataset:

Field Value
label 1
kin_label/kir_label 'inkino'
url ' https://nawe.bi/Primus-Ligue-Imirwi-igiye-guhura-gute-ku-ndwi-ya-6-y-ihiganwa.html'
title 'Primus Ligue\xa0: Imirwi igiye guhura gute ku ndwi ya 6 y’ihiganwa\xa0?'
content ' Inkino zitegekanijwe kuruno wa gatandatu igenekerezo rya 14 Nyakanga umwaka wa 2019...'
en_label 'sport'

Data Fields

The raw version of the data for Kinyarwanda language consists of these fields

  • label: The category of the news article
  • kin_label/kir_label: The associated label in Kinyarwanda/Kirundi language
  • en_label: The associated label in English
  • url: The URL of the news article
  • title: The title of the news article
  • content: The content of the news article

The cleaned version contains only the label , title and the content fields

Data Splits

Lang Train Test
Kinyarwandai Raw 17014 4254
Kinyarwandai Clean 17014 4254
Kirundi Raw 3689 923
Kirundi Clean 3689 923

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@article{niyongabo2020kinnews,
  title={KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi},
  author={Niyongabo, Rubungo Andre and Qu, Hong and Kreutzer, Julia and Huang, Li},
  journal={arXiv preprint arXiv:2010.12174},
  year={2020}
}

Contributions

Thanks to @saradhix for adding this dataset.