Pretrained RoBERTa BASE model on assorted Thai texts (78.5 GB). The script and documentation can be found at this repository .
The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019] .
You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task.
Multiclass text classification
4-class text classification task ( positive , neutral , negative , and question ) based on social media posts and tweets.
Users' review rating classification task (scale is ranging from 1 to 5)
generated_reviews_enth : ( review_star as label)
Generated users' review rating classification task (scale is ranging from 1 to 5).
Multilabel text classification
Thai topic classification with 12 labels based on news article corpus from prachathai.com. The detail is described in this page .
Token classification
Named-entity recognition tagging with 13 named-entities as described in this page .
lst20 : NER NER and POS tagging
Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as described in this page .
The getting started notebook of WangchanBERTa model can be found at this Colab notebook
wangchanberta-base-att-spm-uncased model was pretrained on assorted Thai text dataset. The total size of uncompressed text is 78.5GB.
Texts are preprocessed with the following rules:
Regarding the vocabulary, we use SentencePiece [Kudo, 2018] to train SentencePiece unigram model. The tokenizer has a vocabulary size of 25,000 subwords, trained on 15M sentences sampled from the training set.
The length of each sequence is limited up to 416 subword tokens.
Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with
Train/Val/Test splits
After preprocessing and deduplication, we have a training set of 381,034,638 unique, mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [Phatthiyaphaibun et al., 2020] , 8,680,485,067 subwords as tokenized by SentencePiece tokenizer, and 53,035,823,287 characters.
The model was trained on 8 V100 GPUs for 500,000 steps with the batch size of 4,096 (32 sequences per device with 16 accumulation steps) and a sequence length of 416 tokens. The optimizer we used is Adam with the learning rate of $3e-4$, $\\\\\\\\beta_1 = 0.9$, $\\\\\\\\beta_2= 0.999$ and $\\\\\\\\epsilon = 1e-6$. The learning rate is warmed up for the first 24,000 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint.
As of Sun 24 Jan 2021, we release the model from the checkpoint @360,000 steps due to the model pretraining has not yet been completed
BibTeX entry and citation info
@misc{lowphansirikul2021wangchanberta, title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong}, year={2021}, eprint={2101.09635}, archivePrefix={arXiv}, primaryClass={cs.CL} }