模型:
w11wo/malaysian-distilbert-small
Malaysian DistilBERT Small is a masked language model based on the DistilBERT model . It was trained on the OSCAR dataset, specifically the unshuffled_original_ms subset.
The model was originally HuggingFace's pretrained English DistilBERT model and is later fine-tuned on the Malaysian dataset. It achieved a perplexity of 10.33 on the validation dataset (20% of the dataset). Many of the techniques used are based on a Hugging Face tutorial notebook written by Sylvain Gugger , and fine-tuning tutorial notebook written by Pierre Guillou .
Hugging Face's Transformers library was used to train the model -- utilizing the base DistilBERT model and their Trainer class. PyTorch was used as the backend framework during training, but the model remains compatible with TensorFlow nonetheless.
Model | #params | Arch. | Training/Validation data (text) |
---|---|---|---|
malaysian-distilbert-small | 66M | DistilBERT Small | OSCAR unshuffled_original_ms Dataset |
The model was trained for 1 epoch and the following is the final result once the training ended.
train loss | valid loss | perplexity | total time |
---|---|---|---|
2.476 | 2.336 | 10.33 | 0:40:05 |
from transformers import pipeline pretrained_name = "w11wo/malaysian-distilbert-small" fill_mask = pipeline( "fill-mask", model=pretrained_name, tokenizer=pretrained_name ) fill_mask("Henry adalah seorang lelaki yang tinggal di [MASK].")
from transformers import DistilBertModel, DistilBertTokenizerFast pretrained_name = "w11wo/malaysian-distilbert-small" model = DistilBertModel.from_pretrained(pretrained_name) tokenizer = DistilBertTokenizerFast.from_pretrained(pretrained_name) prompt = "Bolehkah anda [MASK] Bahasa Melayu?" encoded_input = tokenizer(prompt, return_tensors='pt') output = model(**encoded_input)
Do consider the biases which came from the OSCAR dataset that may be carried over into the results of this model.
Malaysian DistilBERT Small was trained and evaluated by Wilson Wongso . All computation and development are done on Google Colaboratory using their free GPU access.