数据集:
kannada_news
任务:
文本分类子任务:
topic-classification语言:
kn计算机处理:
monolingual大小:
1K<n<10K语言创建人:
other批注创建人:
other源数据集:
original许可:
cc-by-sa-4.0The Kannada news dataset contains only the headlines of news article in three categories: Entertainment, Tech, and Sports.
The data set contains around 6300 news article headlines which are collected from Kannada news websites. The data set has been cleaned and contains train and test set using which can be used to benchmark topic classification models in Kannada.
[More Information Needed]
Kannada (kn)
The data has two files. A train.csv and valid.csv. An example row of the dataset is as below:
{ 'headline': 'ಫಿಫಾ ವಿಶ್ವಕಪ್ ಫೈನಲ್: ಅತಿರೇಕಕ್ಕೇರಿದ ಸಂಭ್ರಮಾಚರಣೆ; ಅಭಿಮಾನಿಗಳ ಹುಚ್ಚು ವರ್ತನೆಗೆ ವ್ಯಾಪಕ ಖಂಡನೆ', 'label':'sports' }
NOTE: The data has very few examples on the technology (class label: 'tech') topic. [More Information Needed]
Data has two fields:
The dataset is divided into two splits. All the headlines are scraped from news websites on the internet.
train | validation | |
---|---|---|
Input Sentences | 5167 | 1293 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
There are starkingly less amount of data for South Indian languages, especially Kannada, available in digital format which can be used for NLP purposes. Though having roughly 38 million native speakers, it is a little under-represented language and will benefit from active contribution from the community.
This dataset, however, can just help people get exposed to Kannada and help proceed further active participation for enabling continuous progress and development.
[More Information Needed]
[More Information Needed]
[Gaurav Arora] ( https://github.com/goru001/nlp-for-kannada ). Has also got some starter models an embeddings to help get started.
cc-by-sa-4.0
https://www.kaggle.com/disisbig/kannada-news-dataset
Thanks to @vrindaprabhu for adding this dataset.