TurkuNLP/bert-large-finnish-cased-toxicity | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

TurkuNLP/bert-large-finnish-cased-toxicity

任务:

文本分类

类库:

PyTorch Transformers

数据集:

TurkuNLP/jigsaw_toxicity_pred_fi 3ATurkuNLP/jigsaw_toxicity_pred_fi

语言:

其他:

bert

模型介绍文件清单

中文

bert-large-finnish-cased-v1 for toxicity detection

This is the bert-base-finnish-cased-v1 model , fine-tuned using the Finnish jigsaw_toxicity_pred_fi dataset. The model is trained to predict probabilities for 6 different toxicity labels introduced in the dataset card.

Overview

Language model: bert-base-finnish-v1

Language: Finnish

Downstream-task: Multi-label toxicity detection (multi-label text classification)

Training data: jigsaw_toxicity_pred_fi

Eval data: jigsaw_toxicity_pred_fi

Citing

If you use this model please cite us using the following bibtex.

@inproceedings{eskelinen-etal-2023-toxicity,
    title = "Toxicity Detection in {F}innish Using Machine Translation",
    author = "Eskelinen, Anni  and
      Silvala, Laura  and
      Ginter, Filip  and
      Pyysalo, Sampo  and
      Laippala, Veronika",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.68",
    pages = "685--697",
    abstract = "Due to the popularity of social media platforms and the sheer amount of user-generated content online, the automatic detection of toxic language has become crucial in the creation of a friendly and safe digital space. Previous work has been mostly focusing on English leaving many lower-resource languages behind. In this paper, we present novel resources for toxicity detection in Finnish by introducing two new datasets, a machine translated toxicity dataset for Finnish based on the widely used English Jigsaw dataset and a smaller test set of Suomi24 discussion forum comments originally written in Finnish and manually annotated following the definitions of the labels that were used to annotate the Jigsaw dataset. We show that machine translating the training data to Finnish provides better toxicity detection results than using the original English training data and zero-shot cross-lingual transfer with XLM-R, even with our newly annotated dataset from Suomi24.",
}

Usage

the model can be used through a huggingface pipeline:

model = transformers.AutoModelForSequenceClassification.from_pretrained("TurkuNLP/bert-large-finnish-cased-toxicity")
tokenizer = transformers.AutoTokenizer.from_pretrained("TurkuNLP/bert-large-finnish-cased-v1")
pipe = transformers.pipeline(task="text-classification", model=model, tokenizer=tokenizer, function_to_apply="sigmoid", top_k=None)

Hyperparameters

batch_size = 12
epochs = 10 (trained for 4)
base_LM_model = "bert-large-finnish-cased-v1"
max_seq_len = 512
learning_rate = 2e-5

Performance

F1-micro = 0.66
F1-macro = 0.57
Precision (micro) = 0.58
Recall (micro) = 0.76

作者:

TurkuNLP Research Group

数据集大小:

1.32 GB