Small-E-Czech is an Electra -small model pretrained on a Czech web corpus created at Seznam.cz and introduced in an IAAI 2022 paper . Like other pretrained models, it should be finetuned on a downstream task of interest before use. At Seznam.cz, it has helped improve web search ranking , query typo correction or clickbait titles detection. We release it under CC BY 4.0 license (i.e. allowing commercial use). To raise an issue, please visit our github .
from transformers import ElectraForPreTraining, ElectraTokenizerFast import torch discriminator = ElectraForPreTraining.from_pretrained("Seznam/small-e-czech") tokenizer = ElectraTokenizerFast.from_pretrained("Seznam/small-e-czech") sentence = "Za hory, za doly, mé zlaté parohy" fake_sentence = "Za hory, za doly, kočka zlaté parohy" fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"] fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt") outputs = discriminator(fake_inputs) predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy() for token in fake_sentence_tokens: print("{:>7s}".format(token), end="") print() for prediction in predictions.squeeze(): print("{:7.1f}".format(prediction), end="") print()
In the output we can see the probabilities of particular tokens not belonging in the sentence (i.e. having been faked by the generator) according to the discriminator:
[CLS] za hory , za dol ##y , kočka zlaté paro ##hy [SEP] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.3 0.2 0.1 0.0
For instructions on how to finetune the model on a new task, see the official HuggingFace transformers tutorial .