模型:
Musixmatch/umberto-wikipedia-uncased-v1
UmBERTo is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at github.com/huggingface/transformers
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from Wikipedia-ITA .
Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
---|---|---|---|---|---|---|
umberto-wikipedia-uncased-v1 | YES | YES | SPM | 32K | 100k | Link |
This model was trained with SentencePiece and Whole Word Masking.
These results refers to umberto-wikipedia-uncased model. All details are at Umberto Official Page.
Named Entity Recognition (NER)Dataset | F1 | Precision | Recall | Accuracy |
---|---|---|---|---|
ICAB-EvalITA07 | 86.240 | 85.939 | 86.544 | 98.534 |
WikiNER-ITA | 90.483 | 90.328 | 90.638 | 98.661 |
Dataset | F1 | Precision | Recall | Accuracy |
---|---|---|---|---|
UD_Italian-ISDT | 98.563 | 98.508 | 98.618 | 98.717 |
UD_Italian-ParTUT | 97.810 | 97.835 | 97.784 | 98.060 |
import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1") umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1") encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore") input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1 outputs = umberto(input_ids) last_hidden_states = outputs[0] # The last hidden-state is the first element of the outputPredict masked token:
from transformers import pipeline fill_mask = pipeline( "fill-mask", model="Musixmatch/umberto-wikipedia-uncased-v1", tokenizer="Musixmatch/umberto-wikipedia-uncased-v1" ) result = fill_mask("Umberto Eco è <mask> un grande scrittore") # {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361} # {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269} # {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236} # {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742} # {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
@inproceedings {magnini2006annotazione, title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB}, author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo}, booktitle = {Proc.of SILFI 2006}, year = {2006} } @inproceedings {magnini2006cab, title = {I - CAB: the Italian Content Annotation Bank.}, author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele}, booktitle = {LREC}, pages = {963--968}, year = {2006}, organization = {Citeseer} }
Loreto Parisi : loreto at musixmatch dot com , loretoparisi Simone Francia : simone.francia at musixmatch dot com , simonefrancia Paolo Magnani : paul.magnani95 at gmail dot com , paulthemagno
We do Machine Learning and Artificial Intelligence @ musixmatch Follow us on Twitter Github