SciBERT

这是在 SciBERT: A Pretrained Language Model for Scientific Text 中介绍的预训练模型，它是基于科学文本训练的BERT模型。

训练语料库来自于 Semantic Scholar 。语料库的规模为1.14M篇论文，3.1B个标记。我们在训练中使用了论文的全文，而不仅仅是摘要。

SciBERT有自己的wordpiece词汇表（scivocab），该词汇表是根据训练语料库构建的。我们训练了大小写和非大小写版本。

可用的模型包括：

scibert_scivocab_cased
scibert_scivocab_uncased

原始仓库可以在 here 中找到。

如果使用这些模型，请引用以下论文：

@inproceedings{beltagy-etal-2019-scibert,
    title = "SciBERT: A Pretrained Language Model for Scientific Text",
    author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
    booktitle = "EMNLP",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1371"
}

作者:

Allen Institute for AI

数据集大小:

841.27 MB