中文

RoBERTa-large-fake-news-detection-spanish

This model is a fine-tuned version of PlanTL-GOB-ES/roberta-large-bne on an Spanish Fake News Dataset .

It achieves the following results on the evaluation set:

  • Loss: 1.7474
  • F1: 0.7717
  • Accuracy: 0.7797

So, based on the leaderboard our model outperforms the best model (scores F1 = 0.7666).

Model description

RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the RoBERTa large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain (Biblioteca Nacional de España) from 2009 to 2019.

Intended uses & limitations

The objective of this task is to decide if a news item is fake or real by analyzing its textual representation.

Training and evaluation data

FakeDeS : Fake News Detection in Spanish Shared Task

Fake news provides information that aims to manipulate people for different purposes: terrorism, political elections, advertisement, satire, among others. In social networks, misinformation extends in seconds among thousands of people, so it is necessary to develop tools that help control the amount of false information on the web. Similar tasks are detection of popularity in social networks and detection of subjectivity of messages in this media. A fake news detection system aims to help users detect and filter out potentially deceptive news. The prediction of intentionally misleading news is based on the analysis of truthful and fraudulent previously reviewed news, i.e., annotated corpora.

The Spanish Fake News Corpus is a collection of news compiled from several web sources: established newspapers websites,media companies websites, special websites dedicated to validating fake news, websites designated by different journalists as sites that regularly publish fake news. The news were collected from January to July of 2018 and all of them were written in Mexican Spanish.

The corpus has 971 news collected from January to July, 2018, from different sources:

  • Established newspapers websites,
  • Media companies websites,
  • Special websites dedicated to validating fake news,
  • Websites designated by different journalists as sites that regularly publish fake news.

The corpus was tagged considering only two classes (true or fake), following a manual labeling process:

  • A news is true if there is evidence that it has been published in reliable sites.
  • A news is fake if there is news from reliable sites or specialized website in detection of deceptive content that contradicts it or no other evidence was found about the news besides the source.
  • We collected the true-fake news pair of an event so there is a correlation of news in the corpus. In order to avoid topic bias, the corpus covers news from 9 different topics: Science, Sport, Economy, Education, Entertainment, Politics, Health, Security, and Society. As it can be seen in the table below, the number of fake and true news is quite balanced. Approximately 70% will be used as training corpus (676 news), and the 30% as testing corpus (295 news).

The training corpus contains the following information:

  • Category: Fake/ True

  • Topic: Science/ Sport/ Economy/ Education/ Entertainment/ Politics, Health/ Security/ Society

  • Headline: The title of the news.

  • Text: The complete text of the news.

  • Link: The URL where the news was published.

More information needed

Training procedure

TBA

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss F1 Accuracy
No log 1.0 243 0.6282 0.7513 0.75
No log 2.0 486 0.9600 0.7346 0.7587
0.5099 3.0 729 1.2128 0.7656 0.7570
0.5099 4.0 972 1.4001 0.7606 0.7622
0.1949 5.0 1215 1.9748 0.6475 0.7220
0.1949 6.0 1458 1.7386 0.7706 0.7710
0.0263 7.0 1701 1.7474 0.7717 0.7797
0.0263 8.0 1944 1.8114 0.7695 0.7780
0.0046 9.0 2187 1.8444 0.7709 0.7797
0.0046 10.0 2430 1.8552 0.7709 0.7797

Fast usage with HF pipelines

from transformers import pipeline
ckpt = "Narrativaai/fake-news-detection-spanish"

classifier = pipeline("text-classification", model=ckpt)

headline = "Your headline"
text = "Your article text here..."
    
classifier(headline + " [SEP] " + text)

Framework versions

  • Transformers 4.11.3
  • Pytorch 1.9.0+cu111
  • Datasets 1.14.0
  • Tokenizers 0.10.3

Created by: Narrativa

About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning-based platform, builds and deploys natural language solutions. #NLG #AI