The News Dataset is an English-language dataset containing just over 4k unique news articles scrapped from AriseTv- One of the most popular news television in Nigeria.
It supports news article classification into different categories.
English
''' {'Title': 'Nigeria: APC Yet to Zone Party Positions Ahead of Convention' 'Excerpt': 'The leadership of the All Progressives Congress (APC), has denied reports that it had zoned some party positions ahead of' 'Category': 'politics' 'labels': 2} '''
Dataset Split | Number of instances in split |
---|---|
Train | 4,594 |
Paragraph | 811 |
The code for the dataset creation at https://github.com/chimaobi-okite/NLP-Projects-Competitions/blob/main/NewsCategorization/Data/NewsDataScraping.ipynb . The examples were scrapped from https://www.arise.tv/
The annotation is based on the news category in the arisetv website
Who are the annotators?Journalists at arisetv
The purpose of this dataset is to help develop models that can classify news articles into categories.
This task is useful for efficiently presenting information given a large quantity of text. It should be made clear that any summarizations produced by models trained on this dataset are reflective of the language used in the articles, but are in fact automatically generated.
This data is biased towards news happenings in Nigeria but the model built using it can as well classify news from other parts of the world with a slight degradation in performance.
The dataset is created by people at arise but was scrapped by @github-chimaobi-okite