数据集:

hyperpartisan_news_detection

语言:

en

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for "hyperpartisan_news_detection"

Dataset Summary

Hyperpartisan News Detection was a dataset created for PAN @ SemEval 2019 Task 4. Given a news article text, decide whether it follows a hyperpartisan argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.

There are 2 parts:

  • byarticle: Labeled through crowdsourcing on an article basis. The data contains only articles for which a consensus among the crowdsourcing workers existed.
  • bypublisher: Labeled by the overall bias of the publisher as provided by BuzzFeed journalists or MediaBiasFactCheck.com.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

byarticle
  • Size of downloaded dataset files: 1.00 MB
  • Size of the generated dataset: 2.80 MB
  • Total amount of disk used: 3.81 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "hyperpartisan": true,
    "published_at": "2020-01-01",
    "text": "\"<p>This is a sample article which will contain lots of text</p>\\n    \\n<p>Lorem ipsum dolor sit amet, consectetur adipiscing el...",
    "title": "Example article 1",
    "url": "http://www.example.com/example1"
}
bypublisher
  • Size of downloaded dataset files: 1.00 GB
  • Size of the generated dataset: 5.61 GB
  • Total amount of disk used: 6.61 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "bias": 3,
    "hyperpartisan": false,
    "published_at": "2020-01-01",
    "text": "\"<p>This is a sample article which will contain lots of text</p>\\n    \\n<p>Phasellus bibendum porta nunc, id venenatis tortor fi...",
    "title": "Example article 4",
    "url": "https://example.com/example4"
}

Data Fields

The data fields are the same among all splits.

byarticle
  • text : a string feature.
  • title : a string feature.
  • hyperpartisan : a bool feature.
  • url : a string feature.
  • published_at : a string feature.
bypublisher
  • text : a string feature.
  • title : a string feature.
  • hyperpartisan : a bool feature.
  • url : a string feature.
  • published_at : a string feature.
  • bias : a classification label, with possible values including right (0), right-center (1), least (2), left-center (3), left (4).

Data Splits

byarticle
train
byarticle 645
bypublisher
train validation
bypublisher 600000 150000

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

The collection (including labels) are licensed under a Creative Commons Attribution 4.0 International License .

Citation Information

@inproceedings{kiesel-etal-2019-semeval,
    title = "{S}em{E}val-2019 Task 4: Hyperpartisan News Detection",
    author = "Kiesel, Johannes  and
      Mestre, Maria  and
      Shukla, Rishabh  and
      Vincent, Emmanuel  and
      Adineh, Payam  and
      Corney, David  and
      Stein, Benno  and
      Potthast, Martin",
    booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/S19-2145",
    doi = "10.18653/v1/S19-2145",
    pages = "829--839",
    abstract = "Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.",
}

Contributions

Thanks to @thomwolf , @ghomasHudson for adding this dataset.