模型:
dkleczek/bert-base-polish-uncased-v1
Polish version of BERT language model is here! It is now available in two variants: cased and uncased, both can be downloaded and used via HuggingFace transformers library. I recommend using the cased model, more info on the differences and benchmark results below.
Below is the list of corpora used along with the output of wc command (counting lines, words and characters). These corpora were divided into sentences with srxsegmenter (see references), concatenated and tokenized with HuggingFace BERT Tokenizer.
Tables | Lines | Words | Characters |
---|---|---|---|
Polish subset of Open Subtitles | 236635408 | 1431199601 | 7628097730 |
Polish subset of ParaCrawl | 8470950 | 176670885 | 1163505275 |
Polish Parliamentary Corpus | 9799859 | 121154785 | 938896963 |
Polish Wikipedia - Feb 2020 | 8014206 | 132067986 | 1015849191 |
Total | 262920423 | 1861093257 | 10746349159 |
Tables | Lines | Words | Characters |
---|---|---|---|
Polish subset of Open Subtitles (Deduplicated) | 41998942 | 213590656 | 1424873235 |
Polish subset of ParaCrawl | 8470950 | 176670885 | 1163505275 |
Polish Parliamentary Corpus | 9799859 | 121154785 | 938896963 |
Polish Wikipedia - Feb 2020 | 8014206 | 132067986 | 1015849191 |
Total | 68283960 | 646479197 | 4543124667 |
Polbert is released via HuggingFace Transformers library .
For an example use as language model, see this notebook file.
from transformers import * model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-uncased-v1") tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-uncased-v1") nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer) for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."): print(pred) # Output: # {'sequence': '[CLS] adam mickiewicz wielkim polskim poeta był. [SEP]', 'score': 0.47196975350379944, 'token': 26596} # {'sequence': '[CLS] adam mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.09127858281135559, 'token': 10953} # {'sequence': '[CLS] adam mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.0647173821926117, 'token': 5182} # {'sequence': '[CLS] adam mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.05232388526201248, 'token': 24293} # {'sequence': '[CLS] adam mickiewicz wielkim polskim politykiem był. [SEP]', 'score': 0.04554257541894913, 'token': 44095}
model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-cased-v1") tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1") nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer) for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."): print(pred) # Output: # {'sequence': '[CLS] Adam Mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.5391148328781128, 'token': 37120} # {'sequence': '[CLS] Adam Mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.11683262139558792, 'token': 6810} # {'sequence': '[CLS] Adam Mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.06021466106176376, 'token': 17709} # {'sequence': '[CLS] Adam Mickiewicz wielkim polskim mistrzem był. [SEP]', 'score': 0.051870670169591904, 'token': 14652} # {'sequence': '[CLS] Adam Mickiewicz wielkim polskim artystą był. [SEP]', 'score': 0.031787533313035965, 'token': 35680}
See the next section for an example usage of Polbert in downstream tasks.
Thanks to Allegro, we now have the KLEJ benchmark , a set of nine evaluation tasks for the Polish language understanding. The following results are achieved by running standard set of evaluation scripts (no tricks!) utilizing both cased and uncased variants of Polbert.
Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN | PolEmo2.0-OUT | DYK | PSC | AR |
---|---|---|---|---|---|---|---|---|---|---|
Polbert cased | 81.7 | 93.6 | 93.4 | 93.8 | 52.7 | 87.4 | 71.1 | 59.1 | 98.6 | 85.2 |
Polbert uncased | 81.4 | 90.1 | 93.9 | 93.5 | 55.0 | 88.1 | 68.8 | 59.4 | 98.8 | 85.4 |
Note how the uncased model performs better than cased on some tasks? My guess this is because of the oversampling of Open Subtitles dataset and its similarity to data in some of these tasks. All these benchmark tasks are sequence classification, so the relative strength of the cased model is not so visible here.
The data used to train the model is biased. It may reflect stereotypes related to gender, ethnicity etc. Please be careful when using the model for downstream task to consider these biases and mitigate them.
Darek Kłeczek - contact me on Twitter @dk21