数据集:

kannada_news

任务:

文本分类

子任务:

topic-classification

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

other

批注创建人:

other

源数据集:

original

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

Dataset Card for kannada_news dataset

Dataset Summary

The Kannada news dataset contains only the headlines of news article in three categories: Entertainment, Tech, and Sports.

The data set contains around 6300 news article headlines which are collected from Kannada news websites. The data set has been cleaned and contains train and test set using which can be used to benchmark topic classification models in Kannada.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Kannada (kn)

Dataset Structure

Data Instances

The data has two files. A train.csv and valid.csv. An example row of the dataset is as below:

{
  'headline': 'ಫಿಫಾ ವಿಶ್ವಕಪ್ ಫೈನಲ್: ಅತಿರೇಕಕ್ಕೇರಿದ ಸಂಭ್ರಮಾಚರಣೆ; ಅಭಿಮಾನಿಗಳ ಹುಚ್ಚು ವರ್ತನೆಗೆ ವ್ಯಾಪಕ ಖಂಡನೆ',
  'label':'sports'
}

NOTE: The data has very few examples on the technology (class label: 'tech') topic. [More Information Needed]

Data Fields

Data has two fields:

headline: text headline in kannada (string)
label : corresponding class label which the headlines pertains to in english (string)

Data Splits

The dataset is divided into two splits. All the headlines are scraped from news websites on the internet.

train	validation
Input Sentences	5167	1293

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

There are starkingly less amount of data for South Indian languages, especially Kannada, available in digital format which can be used for NLP purposes. Though having roughly 38 million native speakers, it is a little under-represented language and will benefit from active contribution from the community.

This dataset, however, can just help people get exposed to Kannada and help proceed further active participation for enabling continuous progress and development.

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[Gaurav Arora] ( https://github.com/goru001/nlp-for-kannada ). Has also got some starter models an embeddings to help get started.

Licensing Information

cc-by-sa-4.0

Citation Information

https://www.kaggle.com/disisbig/kannada-news-dataset

Contributions

Thanks to @vrindaprabhu for adding this dataset.

作者:

佚名

数据集大小:

11.26 KB