模型:

daveni/twitter-xlm-roberta-emotion-es

英文

注意:此模型和模型卡基于 finetuned XLM-T for Sentiment Analysis .

Twitter-XLM-roBERTa-base 用于情感分析

这是一个基于XLM-roBERTa-base模型,使用约198M个推特进行训练,并进行了对西班牙语情感分析的微调。该模型在 IberLEF 2021 Conference 的EmoEvalEs竞赛中提出,竞赛任务是将西班牙语推特分类为七个不同的情感类别:愤怒、厌恶、恐惧、喜悦、悲伤、惊讶和其他。我们在比赛中以宏平均F1得分71.70%的成绩获得第一名。

带有 Tweet from @JaSantaolalla 的示例流程

from transformers import pipeline
model_path = "daveni/twitter-xlm-roberta-emotion-es"
emotion_analysis = pipeline("text-classification", framework="pt", model=model_path, tokenizer=model_path)
emotion_analysis("Einstein dijo: Solo hay dos cosas infinitas, el universo y los pinches anuncios de bitcoin en Twitter. Paren ya carajo aaaaaaghhgggghhh me quiero murir")
[{'label': 'anger', 'score': 0.48307016491889954}]

完整的分类示例

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)
model_path = "daveni/twitter-xlm-roberta-emotion-es"
tokenizer = AutoTokenizer.from_pretrained(model_path )
config = AutoConfig.from_pretrained(model_path )
# PT
model = AutoModelForSequenceClassification.from_pretrained(model_path )
text = "Se ha quedao bonito día para publicar vídeo, ¿no? Hoy del tema más diferente que hemos tocado en el canal."
text = preprocess(text)
print(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

输出:

Se ha quedao bonito día para publicar vídeo, ¿no? Hoy del tema más diferente que hemos tocado en el canal.
1) joy 0.7887
2) others 0.1679
3) surprise 0.0152
4) sadness 0.0145
5) anger 0.0077
6) disgust 0.0033
7) fear 0.0027
限制和偏差
  • 我们用于微调的数据集存在不平衡,几乎一半的记录属于其他类别,因此可能存在对该类别的偏见。

训练数据

预训练权重与 cardiffnlp 发布的原始模型保持一致。我们使用了 EmoEvalEs Dataset 进行微调。

BibTeX条目和引用信息

@inproceedings{vera2021gsi,
  title={GSI-UPM at IberLEF2021: Emotion Analysis of Spanish Tweets by Fine-tuning the XLM-RoBERTa Language Model},
  author={Vera, D and Araque, O and Iglesias, CA},
  booktitle={Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021). CEUR Workshop Proceedings, CEUR-WS, M{\'a}laga, Spain},
  year={2021}
}