模型:
dslim/bert-base-NER
bert-base-NER 是一个已经微调过的BERT模型,可直接用于命名实体识别任务,并在NER任务中达到了最先进的性能。它经过训练可以识别四种类型的实体:位置(LOC),组织(ORG),人名(PER)和其他(MISC)。
具体来说,这个模型是在标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本上微调得到的一个 bert-base-cased 模型。
如果您想使用在相同数据集上微调过的更大型的BERT-large模型,也可以使用版本编号为 bert-large-NER 的版本。
您可以通过 Transformers pipeline 来使用此模型进行NER。
from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER") model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER") nlp = pipeline("ner", model=model, tokenizer=tokenizer) example = "My name is Wolfgang and I live in Berlin" ner_results = nlp(example) print(ner_results)Limitations and bias
该模型的训练数据集是特定时间段内的实体标注新闻文章,因此在不同领域的所有用例中可能不具有良好的泛化能力。此外,模型有时会将子单词标记为实体,因此可能需要对结果进行后处理来处理这些情况。
该模型是在标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本上微调得到的。
训练数据集区分实体的开始和连续,因此如果有连续的相同类型实体,模型可以输出第二个实体的开始位置。与数据集一样,每个标记将被归类为以下类别之一:
Abbreviation | Description |
---|---|
O | Outside of a named entity |
B-MIS | Beginning of a miscellaneous entity right after another miscellaneous entity |
I-MIS | Miscellaneous entity |
B-PER | Beginning of a person’s name right after another person’s name |
I-PER | Person’s name |
B-ORG | Beginning of an organization right after another organization |
I-ORG | organization |
B-LOC | Beginning of a location right after another location |
I-LOC | Location |
该数据集源自于路透社语料库,包含路透社的新闻报道。您可以在 CoNLL-2003 论文中了解有关该数据集的创建方式更多细节。
# of training examples per entity typeDataset | LOC | MISC | ORG | PER |
---|---|---|---|---|
Train | 7140 | 3438 | 6321 | 6600 |
Dev | 1837 | 922 | 1341 | 1842 |
Test | 1668 | 702 | 1661 | 1617 |
Dataset | Articles | Sentences | Tokens |
---|---|---|---|
Train | 946 | 14,987 | 203,621 |
Dev | 216 | 3,466 | 51,362 |
Test | 231 | 3,684 | 46,435 |
该模型是使用一台 NVIDIA V100 GPU 进行训练的,使用了 original BERT paper 建议的超参数,并在 CoNLL-2003 NER 任务上对模型进行了训练和评估。
metric | dev | test |
---|---|---|
f1 | 95.1 | 91.3 |
precision | 95.0 | 90.7 |
recall | 95.3 | 91.9 |
测试指标稍低于官方的 Google BERT 结果,后者对文档上下文进行了编码并尝试了 CRF。了解如何复现原始结果 here 。
@article{DBLP:journals/corr/abs-1810-04805, author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}, journal = {CoRR}, volume = {abs/1810.04805}, year = {2018}, url = {http://arxiv.org/abs/1810.04805}, archivePrefix = {arXiv}, eprint = {1810.04805}, timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
@inproceedings{tjong-kim-sang-de-meulder-2003-introduction, title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition", author = "Tjong Kim Sang, Erik F. and De Meulder, Fien", booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003", year = "2003", url = "https://www.aclweb.org/anthology/W03-0419", pages = "142--147", }