The BERT base - cased model trained on all Romanian tweets from 2008 to 2022 .
import torch from transformers import AutoTokenizer, AutoModel # Load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained("Iulian277/ro-bert-tweet") model = AutoModel.from_pretrained("Iulian277/ro-bert-tweet") # Sanitize the input !pip install emoji from normalize import normalize # Use the `normalize` function from the `normalize.py` script normalized_text = normalize("Salut, ce faci?") # Tokenize the sentence and run through the model input_ids = torch.tensor(tokenizer.encode(normalized_text, add_special_tokens=True)).unsqueeze(0) # Batch size 1 outputs = model(input_ids) # Get encoding last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Always use the normalize.py script included in the repository to sanitize you input text, before feeding the tokenizer. Otherwise you will decrease the performance due to the [UNK] tokens.
We'd like to thank TPU Research Cloud for helping us out with the TPU compute power needed to pretrain RoBERTweet models.