模型:
zanelim/singbert-lite-sg
SingBert Lite - Bert for Singlish (SG) and Manglish (MY).
Similar to SingBert but the lite-version, which was initialized from Albert base v2 , with pre-training finetuned on singlish and manglish data.
>>> from transformers import pipeline >>> nlp = pipeline('fill-mask', model='zanelim/singbert-lite-sg') >>> nlp("die [MASK] must try") [{'sequence': '[CLS] die die must try[SEP]', 'score': 0.7731555700302124, 'token': 1327, 'token_str': '▁die'}, {'sequence': '[CLS] die also must try[SEP]', 'score': 0.04763784259557724, 'token': 67, 'token_str': '▁also'}, {'sequence': '[CLS] die still must try[SEP]', 'score': 0.01859409362077713, 'token': 174, 'token_str': '▁still'}, {'sequence': '[CLS] die u must try[SEP]', 'score': 0.015824034810066223, 'token': 287, 'token_str': '▁u'}, {'sequence': '[CLS] die is must try[SEP]', 'score': 0.011271446943283081, 'token': 25, 'token_str': '▁is'}] >>> nlp("dont play [MASK] leh") [{'sequence': '[CLS] dont play play leh[SEP]', 'score': 0.4365769624710083, 'token': 418, 'token_str': '▁play'}, {'sequence': '[CLS] dont play punk leh[SEP]', 'score': 0.06880936771631241, 'token': 6769, 'token_str': '▁punk'}, {'sequence': '[CLS] dont play game leh[SEP]', 'score': 0.051739856600761414, 'token': 250, 'token_str': '▁game'}, {'sequence': '[CLS] dont play games leh[SEP]', 'score': 0.045703962445259094, 'token': 466, 'token_str': '▁games'}, {'sequence': '[CLS] dont play around leh[SEP]', 'score': 0.013458190485835075, 'token': 140, 'token_str': '▁around'}] >>> nlp("catch no [MASK]") [{'sequence': '[CLS] catch no ball[SEP]', 'score': 0.6197211146354675, 'token': 1592, 'token_str': '▁ball'}, {'sequence': '[CLS] catch no balls[SEP]', 'score': 0.08441998809576035, 'token': 7152, 'token_str': '▁balls'}, {'sequence': '[CLS] catch no joke[SEP]', 'score': 0.0676785409450531, 'token': 8186, 'token_str': '▁joke'}, {'sequence': '[CLS] catch no?[SEP]', 'score': 0.040638409554958344, 'token': 60, 'token_str': '?'}, {'sequence': '[CLS] catch no one[SEP]', 'score': 0.03546864539384842, 'token': 53, 'token_str': '▁one'}] >>> nlp("confirm plus [MASK]") [{'sequence': '[CLS] confirm plus chop[SEP]', 'score': 0.9608421921730042, 'token': 17144, 'token_str': '▁chop'}, {'sequence': '[CLS] confirm plus guarantee[SEP]', 'score': 0.011784233152866364, 'token': 9120, 'token_str': '▁guarantee'}, {'sequence': '[CLS] confirm plus confirm[SEP]', 'score': 0.010571340098977089, 'token': 10265, 'token_str': '▁confirm'}, {'sequence': '[CLS] confirm plus egg[SEP]', 'score': 0.0033525123726576567, 'token': 6387, 'token_str': '▁egg'}, {'sequence': '[CLS] confirm plus bet[SEP]', 'score': 0.0008760977652855217, 'token': 5676, 'token_str': '▁bet'}]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AlbertTokenizer, AlbertModel tokenizer = AlbertTokenizer.from_pretrained('zanelim/singbert-lite-sg') model = AlbertModel.from_pretrained("zanelim/singbert-lite-sg") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
and in TensorFlow:
from transformers import AlbertTokenizer, TFAlbertModel tokenizer = AlbertTokenizer.from_pretrained("zanelim/singbert-lite-sg") model = TFAlbertModel.from_pretrained("zanelim/singbert-lite-sg") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)Limitations and bias
This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew) corpus. The corpus is collected from subreddits- r/singapore and r/malaysia , and forums such as hardwarezone .
Initialized with albert base v2 vocab and checkpoints (pre-trained weights).
Pre-training was further finetuned on training data with the following hyperparameters