模型:
s-nlp/russian_toxicity_classifier
Bert-based classifier (finetuned from Conversational Rubert ) trained on merge of Russian Language Toxic Comments dataset collected from 2ch.hk and Toxic Russian Comments dataset collected from ok.ru.
The datasets were merged, shuffled, and split into train, dev, test splits in 80-10-10 proportion. The metrics obtained from test dataset is as follows
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.98 | 0.99 | 0.98 | 21384 |
1 | 0.94 | 0.92 | 0.93 | 4886 |
accuracy | 0.97 | 26270 | ||
macro avg | 0.96 | 0.96 | 0.96 | 26270 |
weighted avg | 0.97 | 0.97 | 0.97 | 26270 |
from transformers import BertTokenizer, BertForSequenceClassification # load tokenizer and model weights tokenizer = BertTokenizer.from_pretrained('SkolkovoInstitute/russian_toxicity_classifier') model = BertForSequenceClassification.from_pretrained('SkolkovoInstitute/russian_toxicity_classifier') # prepare the input batch = tokenizer.encode('ты супер', return_tensors='pt') # inference model(batch)
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License .