数据集:

hyperpartisan_news_detection

任务:

文本分类

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

crowdsourced expert-generated

源数据集:

original

其他:

bias-classification

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for "hyperpartisan_news_detection"

Dataset Summary

Hyperpartisan News Detection was a dataset created for PAN @ SemEval 2019 Task 4. Given a news article text, decide whether it follows a hyperpartisan argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.

There are 2 parts:

byarticle: Labeled through crowdsourcing on an article basis. The data contains only articles for which a consensus among the crowdsourcing workers existed.
bypublisher: Labeled by the overall bias of the publisher as provided by BuzzFeed journalists or MediaBiasFactCheck.com.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

byarticle

Size of downloaded dataset files: 1.00 MB
Size of the generated dataset: 2.80 MB
Total amount of disk used: 3.81 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "hyperpartisan": true,
    "published_at": "2020-01-01",
    "text": "\"<p>This is a sample article which will contain lots of text</p>\\n    \\n<p>Lorem ipsum dolor sit amet, consectetur adipiscing el...",
    "title": "Example article 1",
    "url": "http://www.example.com/example1"
}

bypublisher

Size of downloaded dataset files: 1.00 GB
Size of the generated dataset: 5.61 GB
Total amount of disk used: 6.61 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "bias": 3,
    "hyperpartisan": false,
    "published_at": "2020-01-01",
    "text": "\"<p>This is a sample article which will contain lots of text</p>\\n    \\n<p>Phasellus bibendum porta nunc, id venenatis tortor fi...",
    "title": "Example article 4",
    "url": "https://example.com/example4"
}

Data Fields

The data fields are the same among all splits.

byarticle

text : a string feature.
title : a string feature.
hyperpartisan : a bool feature.
url : a string feature.
published_at : a string feature.

bypublisher

text : a string feature.
title : a string feature.
hyperpartisan : a bool feature.
url : a string feature.
published_at : a string feature.
bias : a classification label, with possible values including right (0), right-center (1), least (2), left-center (3), left (4).

Data Splits

byarticle

train
byarticle	645

bypublisher

train	validation
bypublisher	600000	150000

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

The collection (including labels) are licensed under a Creative Commons Attribution 4.0 International License .

Citation Information

@inproceedings{kiesel-etal-2019-semeval,
    title = "{S}em{E}val-2019 Task 4: Hyperpartisan News Detection",
    author = "Kiesel, Johannes  and
      Mestre, Maria  and
      Shukla, Rishabh  and
      Vincent, Emmanuel  and
      Adineh, Payam  and
      Corney, David  and
      Stein, Benno  and
      Potthast, Martin",
    booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/S19-2145",
    doi = "10.18653/v1/S19-2145",
    pages = "829--839",
    abstract = "Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.",
}

Contributions

Thanks to @thomwolf , @ghomasHudson for adding this dataset.

作者:

佚名

数据集大小:

1.25 GB

Dataset Card for "hyperpartisan_news_detection"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions