It is BERT-base model pre-trained with indonesian Wikipedia and indonesian newspapers using a masked language modeling (MLM) objective. This model is uncased.
This is one of several other language models that have been pre-trained with indonesian datasets. More detail about its usage on downstream tasks (text classification, text generation, etc) is available at Transformer based Indonesian Language Models
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='cahya/bert-base-indonesian-1.5G') >>> unmasker("Ibu ku sedang bekerja [MASK] supermarket") [{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]', 'score': 0.7983310222625732, 'token': 1495}, {'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]', 'score': 0.090003103017807, 'token': 17}, {'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]', 'score': 0.025469014421105385, 'token': 1600}, {'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]', 'score': 0.017966199666261673, 'token': 1555}, {'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]', 'score': 0.016971781849861145, 'token': 1572}]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import BertTokenizer, BertModel model_name='cahya/bert-base-indonesian-1.5G' tokenizer = BertTokenizer.from_pretrained(model_name) model = BertModel.from_pretrained(model_name) text = "Silakan diganti dengan text apa saja." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
and in Tensorflow:
from transformers import BertTokenizer, TFBertModel model_name='cahya/bert-base-indonesian-1.5G' tokenizer = BertTokenizer.from_pretrained(model_name) model = TFBertModel.from_pretrained(model_name) text = "Silakan diganti dengan text apa saja." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
This model was pre-trained with 522MB of indonesian Wikipedia and 1GB of indonesian newspapers . The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are then of the form:
[CLS] Sentence A [SEP] Sentence B [SEP]