模型:
s-nlp/russian_toxicity_classifier
Bert-based分类器(从 Conversational Rubert 微调)训练于从2ch.hk收集的合并的俄语语言有害评论 dataset 和从ok.ru收集的有害俄语评论 dataset 。
数据集被合并、洗牌,并以80-10-10的比例划分为训练、验证和测试集。从测试数据集获得的指标如下
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.98 | 0.99 | 0.98 | 21384 |
1 | 0.94 | 0.92 | 0.93 | 4886 |
accuracy | 0.97 | 26270 | ||
macro avg | 0.96 | 0.96 | 0.96 | 26270 |
weighted avg | 0.97 | 0.97 | 0.97 | 26270 |
from transformers import BertTokenizer, BertForSequenceClassification # load tokenizer and model weights tokenizer = BertTokenizer.from_pretrained('SkolkovoInstitute/russian_toxicity_classifier') model = BertForSequenceClassification.from_pretrained('SkolkovoInstitute/russian_toxicity_classifier') # prepare the input batch = tokenizer.encode('ты супер', return_tensors='pt') # inference model(batch)
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License 。