模型:
indolem/indobert-base-uncased
IndoBERT is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources:
We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being 3.97 (similar to English BERT-base).
This IndoBERT was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.
Task | Metric | Bi-LSTM | mBERT | MalayBERT | IndoBERT |
---|---|---|---|---|---|
POS Tagging | Acc | 95.4 | 96.8 | 96.8 | 96.8 |
NER UGM | F1 | 70.9 | 71.6 | 73.2 | 74.9 |
NER UI | F1 | 82.2 | 82.2 | 87.4 | 90.1 |
Dep. Parsing (UD-Indo-GSD) | UAS/LAS | 85.25/80.35 | 86.85/81.78 | 86.99/81.87 | 87.12 / 82.32 |
Dep. Parsing (UD-Indo-PUD) | UAS/LAS | 84.04/79.01 | 90.58 / 85.44 | 88.91/83.56 | 89.23/83.95 |
Sentiment Analysis | F1 | 71.62 | 76.58 | 82.02 | 84.13 |
Summarization | R1/R2/RL | 67.96/61.65/67.24 | 68.40/61.66/67.67 | 68.44/61.38/67.71 | 69.93 / 62.86 / 69.21 |
Next Tweet Prediction | Acc | 73.6 | 92.4 | 93.1 | 93.7 |
Tweet Ordering | Spearman corr. | 0.45 | 0.53 | 0.51 | 0.59 |
The paper is published at the 28th COLING 2020. Please refer to https://indolem.github.io for more details about the benchmarks.
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("indolem/indobert-base-uncased") model = AutoModel.from_pretrained("indolem/indobert-base-uncased")
If you use our work, please cite:
@inproceedings{koto2020indolem, title={IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP}, author={Fajri Koto and Afshin Rahimi and Jey Han Lau and Timothy Baldwin}, booktitle={Proceedings of the 28th COLING}, year={2020} }