s-nlp/rubert-base-corruption-detector | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

s-nlp/rubert-base-corruption-detector

任务:

文本分类

类库:

PyTorch Transformers

语言:

其他:

bert fluency

模型介绍文件清单

中文

This is a model for evaluation of naturalness of short Russian texts. It has been trained to distinguish human-written texts from their corrupted versions.

Corruption sources: random replacement, deletion, addition, shuffling, and re-inflection of words and characters, random changes of capitalization, round-trip translation, filling random gaps with T5 and RoBERTA models. For each original text, we sampled three corrupted texts, so the model is uniformly biased towards the unnatural label.

Data sources: web-corpora from the Leipzig collection ( rus_news_2020_100K , rus_newscrawl-public_2018_100K , rus-ru_web-public_2019_100K , rus_wikipedia_2021_100K ), comments from OK and Pikabu .

On our private test dataset, the model has achieved 40% rank correlation with human judgements of naturalness, which is higher than GPT perplexity, another popular fluency metric.

作者:

s-nlp

数据集大小:

682.62 MB