数据集:

id_clickbait

子任务:

fact-checking

语言:

id

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for Indonesian Clickbait Headlines

Dataset Summary

The CLICK-ID dataset is a collection of Indonesian news headlines that was collected from 12 local online news publishers; detikNews, Fimela, Kapanlagi, Kompas, Liputan6, Okezone, Posmetro-Medan, Republika, Sindonews, Tempo, Tribunnews, and Wowkeren. This dataset is comprised of mainly two parts; (i) 46,119 raw article data, and (ii) 15,000 clickbait annotated sample headlines. Annotation was conducted with 3 annotator examining each headline. Judgment were based only on the headline. The majority then is considered as the ground truth. In the annotated sample, our annotation shows 6,290 clickbait and 8,710 non-clickbait.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Indonesian

Dataset Structure

Data Instances

An example of the annotated article:

{
  'id': '100',
  'label': 1,
  'title': "SAH! Ini Daftar Nama Menteri Kabinet Jokowi - Ma'ruf Amin"
}
>

Data Fields

Annotated
  • id : id of the sample
  • title : the title of the news article
  • label : the label of the article, either non-clickbait or clickbait
Raw
  • id : id of the sample
  • title : the title of the news article
  • source : the name of the publisher/newspaper
  • date : date
  • category : the category of the article
  • sub-category : the sub category of the article
  • content : the content of the article
  • url : the url of the article

Data Splits

The dataset contains train set.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Creative Commons Attribution 4.0 International license

Citation Information

@article{WILLIAM2020106231,
title = "CLICK-ID: A novel dataset for Indonesian clickbait headlines",
journal = "Data in Brief",
volume = "32",
pages = "106231",
year = "2020",
issn = "2352-3409",
doi = "https://doi.org/10.1016/j.dib.2020.106231",
url = "http://www.sciencedirect.com/science/article/pii/S2352340920311252",
author = "Andika William and Yunita Sari",
keywords = "Indonesian, Natural Language Processing, News articles, Clickbait, Text-classification",
abstract = "News analysis is a popular task in Natural Language Processing (NLP). In particular, the problem of clickbait in news analysis has gained attention in recent years [1, 2]. However, the majority of the tasks has been focused on English news, in which there is already a rich representative resource. For other languages, such as Indonesian, there is still a lack of resource for clickbait tasks. Therefore, we introduce the CLICK-ID dataset of Indonesian news headlines extracted from 12 Indonesian online news publishers. It is comprised of 15,000 annotated headlines with clickbait and non-clickbait labels. Using the CLICK-ID dataset, we then developed an Indonesian clickbait classification model achieving favourable performance. We believe that this corpus will be useful for replicable experiments in clickbait detection or other experiments in NLP areas."
}

Contributions

Thanks to @cahya-wirawan for adding this dataset.