模型:
tartuNLP/EstBERT
The EstBERT model is a pretrained BERT Base model exclusively trained on Estonian cased corpus on both 128 and 512 sequence length of data.
You can use the model transformer library both in tensorflow and pytorch version.
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("tartuNLP/EstBERT") model = AutoModelForMaskedLM.from_pretrained("tartuNLP/EstBERT")
You can also download the pretrained model from here, EstBERT_128 EstBERT_512
Dataset used to train the modelThe EstBERT model is trained both on 128 and 512 sequence length of data. For training the EstBERT we used the Estonian National Corpus 2017 , which was the largest Estonian language corpus available at the time. It consists of four sub-corpora: Estonian Reference Corpus 1990-2008, Estonian Web Corpus 2013, Estonian Web Corpus 2017 and Estonian Wikipedia Corpus 2017.
Overall EstBERT performs better in parts of speech (POS), name entity recognition (NER), rubric, and sentiment classification tasks compared to mBERT and XLM-RoBERTa. The comparative results can be found below;
Model | UPOS | XPOS | Morph | bf UPOS | bf XPOS | Morph |
---|---|---|---|---|---|---|
EstBERT | 97.89 | 98.40 | 96.93 | 97.84 | 98.43 | 96.80 |
mBERT | 97.42 | 98.06 | 96.24 | 97.43 | 98.13 | 96.13 |
XLM-RoBERTa | 97.78 | 98.36 | 96.53 | 97.80 | 98.40 | 96.69 |
Model | Rubric 128 | Sentiment 128 | Rubric 128 | Sentiment 512 |
---|---|---|---|---|
EstBERT | 81.70 | 74.36 | 80.96 | 74.50 |
mBERT | 75.67 | 70.23 | 74.94 | 69.52 |
XLM-RoBERTa | 80.34 | 74.50 | 78.62 | 76.07 |
Model | Precicion 128 | Recall 128 | F1-Score 128 | Precision 512 | Recall 512 | F1-Score 512 |
---|---|---|---|---|---|---|
EstBERT | 88.42 | 90.38 | 89.39 | 88.35 | 89.74 | 89.04 |
mBERT | 85.88 | 87.09 | 86.51 | 88.47 | 88.28 | 88.37 |
XLM-RoBERTa | 87.55 | 91.19 | 89.34 | 87.50 | 90.76 | 89.10 |