模型:
sagorsarker/bangla-bert-base
A long way passed. Here is our Bangla-Bert ! It is now available in huggingface model hub.
Bangla-Bert-Base is a pretrained language model of Bengali language using mask language modeling described in BERT and it's github repository
Corpus was downloaded from two main sources:
After downloading these corpora, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
sentence 1 sentence 2 sentence 1 sentence 2
We used BNLP package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. Our final vocab file availabe at https://github.com/sagorbrur/bangla-bert and also at huggingface model hub.
After training 1 million steps here are the evaluation results.
global_step = 1000000 loss = 2.2406516 masked_lm_accuracy = 0.60641736 masked_lm_loss = 2.201459 next_sentence_accuracy = 0.98625 next_sentence_loss = 0.040997364 perplexity = numpy.exp(2.2406516) = 9.393331287442784 Loss for final step: 2.426227
Huge Thanks to Nick Doiron for providing evaluation results of the classification task. He used Bengali Classification Benchmark datasets for the classification task. Comparing to Nick's Bengali electra and multi-lingual BERT, Bangla BERT Base achieves a state of the art result. Here is the evaluation script .
Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average |
---|---|---|---|---|
mBERT | 68.15 | 52.32 | 72.27 | 64.25 |
Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
We evaluated Bangla-BERT-Base with Wikiann Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT). Bangla-BERT-Base got a third-place where mBERT got first and XML-R got second place after training these models 5 epochs.
Base Pre-trained Model | F1 Score | Accuracy |
---|---|---|
mBERT-uncased | 97.11 | 97.68 |
XLM-R | 96.22 | 97.03 |
Indic-BERT | 92.66 | 94.74 |
Bangla-BERT-Base | 95.57 | 97.49 |
All four model trained with transformers-token-classification notebook. You can find all models evaluation results here
Also, you can check the below paper list. They used this model on their datasets.
NB: If you use this model for any NLP task please share evaluation results with us. We will add it here.
Bangla BERT Tokenizer
from transformers import AutoTokenizer, AutoModel bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base") text = "আমি বাংলায় গান গাই।" bnbert_tokenizer.tokenize(text) # ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']
MASK Generation
You can use this model directly with a pipeline for masked language modeling:
from transformers import BertForMaskedLM, BertTokenizer, pipeline model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base") tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base") nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer) for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"): print(pred) # {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}
If you find this model helpful, please cite.
@misc{Sagor_2020, title = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understading}, author = {Sagor Sarker}, year = {2020}, url = {https://github.com/sagorbrur/bangla-bert} }