模型:
gerulata/slovakbert
SlovakBERT pretrained model on Slovak language using a masked language modeling (MLM) objective. This model is case-sensitive: it makes a difference between slovensko and Slovensko.
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. IMPORTANT : The model was not trained on the “ and ” (direct quote) character -> so before tokenizing the text, it is advised to replace all “ and ” (direct quote marks) with a single "(double quote marks).
You can use this model directly with a pipeline for masked language modeling:
from transformers import pipeline unmasker = pipeline('fill-mask', model='gerulata/slovakbert') unmasker("Deti sa <mask> na ihrisku.") [{'sequence': 'Deti sa hrali na ihrisku.', 'score': 0.6355380415916443, 'token': 5949, 'token_str': ' hrali'}, {'sequence': 'Deti sa hrajú na ihrisku.', 'score': 0.14731724560260773, 'token': 9081, 'token_str': ' hrajú'}, {'sequence': 'Deti sa zahrali na ihrisku.', 'score': 0.05016357824206352, 'token': 32553, 'token_str': ' zahrali'}, {'sequence': 'Deti sa stretli na ihrisku.', 'score': 0.041727423667907715, 'token': 5964, 'token_str': ' stretli'}, {'sequence': 'Deti sa učia na ihrisku.', 'score': 0.01886524073779583, 'token': 18099, 'token_str': ' učia'}]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import RobertaTokenizer, RobertaModel tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert') model = RobertaModel.from_pretrained('gerulata/slovakbert') text = "Text ktorý sa má embedovať." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
and in TensorFlow:
from transformers import RobertaTokenizer, TFRobertaModel tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert') model = TFRobertaModel.from_pretrained('gerulata/slovakbert') text = "Text ktorý sa má embedovať." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
Or extract information from the model like this:
from transformers import pipeline unmasker = pipeline('fill-mask', model='gerulata/slovakbert') unmasker("Slovenské národne povstanie sa uskutočnilo v roku <mask>.") [{'sequence': 'Slovenske narodne povstanie sa uskutočnilo v roku 1944.', 'score': 0.7383289933204651, 'token': 16621, 'token_str': ' 1944'},...]
The SlovakBERT model was pretrained on these datasets:
The text was then processed with the following steps:
We segmented the resulting corpus into sentences and removed duplicates to get 181.6M unique sentences. In total, the final corpus has 19.35GB of text.
The model was trained in fairseq on 4 x Nvidia A100 GPUs for 300K steps with a batch size of 512 and a sequence length of 512. The optimizer used is Adam with a learning rate of 5e-4, β 1 = 0.9 \beta_{1} = 0.9 β 1 = 0 . 9 , β 2 = 0.98 \beta_{2} = 0.98 β 2 = 0 . 9 8 and ϵ = 1 e − 6 \epsilon = 1e-6 ϵ = 1 e − 6 , a weight decay of 0.01, dropout rate 0.1, learning rate warmup for 10k steps and linear decay of the learning rate after. We used 16-bit float precision.
Gerulata Technologies is a tech company on a mission to provide tools for fighting disinformation and hostile propaganda.
At Gerulata, we focus on providing state-of-the-art AI-powered tools that empower human analysts and provide them with the ability to make informed decisions.
Our tools allow for the monitoring and analysis of online activity, as well as the detection and tracking of disinformation and hostile propaganda campaigns. With our products, our clients are better equipped to identify and respond to threats in real-time.
If you find our resource or paper is useful, please consider including the following citation in your paper.
@misc{pikuliak2021slovakbert, title={SlovakBERT: Slovak Masked Language Model}, author={Matúš Pikuliak and Štefan Grivalský and Martin Konôpka and Miroslav Blšták and Martin Tamajka and Viktor Bachratý and Marián Šimko and Pavol Balážik and Michal Trnka and Filip Uhlárik}, year={2021}, eprint={2109.15254}, archivePrefix={arXiv}, primaryClass={cs.CL} }