Hyperpartisan News Detection was a dataset created for PAN @ SemEval 2019 Task 4. Given a news article text, decide whether it follows a hyperpartisan argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.
There are 2 parts:
An example of 'train' looks as follows.
This example was too long and was cropped: { "hyperpartisan": true, "published_at": "2020-01-01", "text": "\"<p>This is a sample article which will contain lots of text</p>\\n \\n<p>Lorem ipsum dolor sit amet, consectetur adipiscing el...", "title": "Example article 1", "url": "http://www.example.com/example1" }bypublisher
An example of 'train' looks as follows.
This example was too long and was cropped: { "bias": 3, "hyperpartisan": false, "published_at": "2020-01-01", "text": "\"<p>This is a sample article which will contain lots of text</p>\\n \\n<p>Phasellus bibendum porta nunc, id venenatis tortor fi...", "title": "Example article 4", "url": "https://example.com/example4" }
The data fields are the same among all splits.
byarticletrain | |
---|---|
byarticle | 645 |
train | validation | |
---|---|---|
bypublisher | 600000 | 150000 |
The collection (including labels) are licensed under a Creative Commons Attribution 4.0 International License .
@inproceedings{kiesel-etal-2019-semeval, title = "{S}em{E}val-2019 Task 4: Hyperpartisan News Detection", author = "Kiesel, Johannes and Mestre, Maria and Shukla, Rishabh and Vincent, Emmanuel and Adineh, Payam and Corney, David and Stein, Benno and Potthast, Martin", booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation", month = jun, year = "2019", address = "Minneapolis, Minnesota, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S19-2145", doi = "10.18653/v1/S19-2145", pages = "829--839", abstract = "Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.", }
Thanks to @thomwolf , @ghomasHudson for adding this dataset.