模型:
daveni/twitter-xlm-roberta-emotion-es
注意:此模型和模型卡基于 finetuned XLM-T for Sentiment Analysis .
这是一个基于XLM-roBERTa-base模型,使用约198M个推特进行训练,并进行了对西班牙语情感分析的微调。该模型在 IberLEF 2021 Conference 的EmoEvalEs竞赛中提出,竞赛任务是将西班牙语推特分类为七个不同的情感类别:愤怒、厌恶、恐惧、喜悦、悲伤、惊讶和其他。我们在比赛中以宏平均F1得分71.70%的成绩获得第一名。
from transformers import pipeline model_path = "daveni/twitter-xlm-roberta-emotion-es" emotion_analysis = pipeline("text-classification", framework="pt", model=model_path, tokenizer=model_path) emotion_analysis("Einstein dijo: Solo hay dos cosas infinitas, el universo y los pinches anuncios de bitcoin en Twitter. Paren ya carajo aaaaaaghhgggghhh me quiero murir")
[{'label': 'anger', 'score': 0.48307016491889954}]
from transformers import AutoModelForSequenceClassification from transformers import AutoTokenizer, AutoConfig import numpy as np from scipy.special import softmax # Preprocess text (username and link placeholders) def preprocess(text): new_text = [] for t in text.split(" "): t = '@user' if t.startswith('@') and len(t) > 1 else t t = 'http' if t.startswith('http') else t new_text.append(t) return " ".join(new_text) model_path = "daveni/twitter-xlm-roberta-emotion-es" tokenizer = AutoTokenizer.from_pretrained(model_path ) config = AutoConfig.from_pretrained(model_path ) # PT model = AutoModelForSequenceClassification.from_pretrained(model_path ) text = "Se ha quedao bonito día para publicar vídeo, ¿no? Hoy del tema más diferente que hemos tocado en el canal." text = preprocess(text) print(text) encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) scores = output[0][0].detach().numpy() scores = softmax(scores) # Print labels and scores ranking = np.argsort(scores) ranking = ranking[::-1] for i in range(scores.shape[0]): l = config.id2label[ranking[i]] s = scores[ranking[i]] print(f"{i+1}) {l} {np.round(float(s), 4)}")
输出:
Se ha quedao bonito día para publicar vídeo, ¿no? Hoy del tema más diferente que hemos tocado en el canal. 1) joy 0.7887 2) others 0.1679 3) surprise 0.0152 4) sadness 0.0145 5) anger 0.0077 6) disgust 0.0033 7) fear 0.0027限制和偏差
预训练权重与 cardiffnlp 发布的原始模型保持一致。我们使用了 EmoEvalEs Dataset 进行微调。
@inproceedings{vera2021gsi, title={GSI-UPM at IberLEF2021: Emotion Analysis of Spanish Tweets by Fine-tuning the XLM-RoBERTa Language Model}, author={Vera, D and Araque, O and Iglesias, CA}, booktitle={Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021). CEUR Workshop Proceedings, CEUR-WS, M{\'a}laga, Spain}, year={2021} }