模型:
allegro/herbert-base-cased
HerBERT 是一个基于波兰语语料库训练的基于BERT的语言模型,使用了掩码语言建模(Masked Language Modelling, MLM)和句子结构目标(Sentence Structural Objective, SSO)以及对整个单词的动态掩码。详细信息请参考: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish 。
模型训练和实验使用了版本2.9的 transformers 。
HerBERT是基于波兰语的六个不同语料库进行训练的:
| Corpus | Tokens | Documents | 
|---|---|---|
| 1235321 | 3243M | 7.9M | 
| 1236321 | 2641M | 7.0M | 
| 1237321 | 1357M | 3.9M | 
| 1238321 | 1056M | 1.1M | 
| 1239321 | 260M | 1.4M | 
| 12310321 | 41M | 5.5k | 
训练数据集使用字符级字节对编码(CharBPETokenizer)进行了子词划分,词汇表大小为50k个词元。分词器本身是使用 tokenizers 库进行训练的。
我们建议您使用分词器的快速版本,即HerbertTokenizerFast。
示例代码:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")
output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
            )
        ],
    padding='longest',
    add_special_tokens=True,
    return_tensors='pt'
    )
)
 CC BY 4.0
如果您使用了此模型,请引用以下论文:
@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}
 模型由 Machine Learning Research Team at Allegro 和 Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences 进行了训练。
您可以通过电子邮件联系我们:klejbenchmark@allegro.pl