模型:
indolem/indobertweet-base-uncased
Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing ( EMNLP 2021 ), Dominican Republic (virtual).
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary.
In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections.
We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of 409M word tokens , two times larger than the training data used to pretrain IndoBERT . Due to Twitter policy, this pretraining data will not be released to public.
Load model and tokenizer (tested with transformers==3.5.1)
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased") model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")
Preprocessing Steps:
Models | Sentiment | Emotion | Hate Speech | NER | Average | |||
---|---|---|---|---|---|---|---|---|
IndoLEM | SmSA | EmoT | HS1 | HS2 | Formal | Informal | ||
mBERT | 76.6 | 84.7 | 67.5 | 85.1 | 75.1 | 85.2 | 83.2 | 79.6 |
malayBERT | 82.0 | 84.1 | 74.2 | 85.0 | 81.9 | 81.9 | 81.3 | 81.5 |
IndoBERT (Willie, et al., 2020) | 84.1 | 88.7 | 73.3 | 86.8 | 80.4 | 86.3 | 84.3 | 83.4 |
IndoBERT (Koto, et al., 2020) | 84.1 | 87.9 | 71.0 | 86.4 | 79.3 | 88.0 | 86.9 | 83.4 |
IndoBERTweet (1M steps from scratch) | 86.2 | 90.4 | 76.0 | 88.8 | 87.5 | 88.1 | 85.4 | 86.1 |
IndoBERT + Voc adaptation + 200k steps | 86.6 | 92.7 | 79.0 | 88.4 | 84.0 | 87.7 | 86.9 | 86.5 |
If you use our work, please cite:
@inproceedings{koto2021indobertweet, title={IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization}, author={Fajri Koto and Jey Han Lau and Timothy Baldwin}, booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)}, year={2021} }