数据集:

fake_news_filipino

子任务:

fact-checking

语言:

tl

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

crowdsourced

批注创建人:

expert-generated

源数据集:

original
中文

Dataset Card for Fake News Filipino

Dataset Summary

Low-Resource Fake News Detection Corpora in Filipino. The first of its kind. Contains 3,206 expertly-labeled news samples, half of which are real and half of which are fake.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The dataset is primarily in Filipino, with the addition of some English words commonly used in Filipino vernacular.

Dataset Structure

Data Instances

Sample data:

{
  "label": "0",
  "article": "Sa 8-pahinang desisyon, pinaboran ng Sandiganbayan First Division ang petition for Writ of Preliminary Attachment/Garnishment na inihain ng prosekusyon laban sa mambabatas."
}

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Fake news articles were sourced from online sites that were tagged as fake news sites by the non-profit independent media fact-checking organization Verafiles and the National Union of Journalists in the Philippines (NUJP). Real news articles were sourced from mainstream news websites in the Philippines, including Pilipino Star Ngayon, Abante, and Bandera.

Curation Rationale

We remedy the lack of a proper, curated benchmark dataset for fake news detection in Filipino by constructing and producing what we call “Fake News Filipino.”

Source Data

Initial Data Collection and Normalization

We construct the dataset by scraping our source websites, encoding all characters into UTF-8. Preprocessing was light to keep information intact: we retain capitalization and punctuation, and do not correct any misspelled words.

Who are the source language producers?

Jan Christian Blaise Cruz, Julianne Agatha Tan, and Charibeth Cheng

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Jan Christian Cruz , Julianne Agatha Tan, and Charibeth Cheng

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{cruz2020localization,
  title={Localization of Fake News Detection via Multitask Transfer Learning},
  author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={2596--2604},
  year={2020}
}

Contributions

Thanks to @anaerobeth for adding this dataset.