英文

bert-base-NER

Model description

bert-base-NER 是一个已经微调过的BERT模型,可直接用于命名实体识别任务,并在NER任务中达到了最先进的性能。它经过训练可以识别四种类型的实体:位置(LOC),组织(ORG),人名(PER)和其他(MISC)。

具体来说,这个模型是在标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本上微调得到的一个 bert-base-cased 模型。

如果您想使用在相同数据集上微调过的更大型的BERT-large模型,也可以使用版本编号为 bert-large-NER 的版本。

Intended uses & limitations

How to use

您可以通过 Transformers pipeline 来使用此模型进行NER。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)
Limitations and bias

该模型的训练数据集是特定时间段内的实体标注新闻文章,因此在不同领域的所有用例中可能不具有良好的泛化能力。此外,模型有时会将子单词标记为实体,因此可能需要对结果进行后处理来处理这些情况。

Training data

该模型是在标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本上微调得到的。

训练数据集区分实体的开始和连续,因此如果有连续的相同类型实体,模型可以输出第二个实体的开始位置。与数据集一样,每个标记将被归类为以下类别之一:

Abbreviation Description
O Outside of a named entity
B-MIS Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS Miscellaneous entity
B-PER Beginning of a person’s name right after another person’s name
I-PER Person’s name
B-ORG Beginning of an organization right after another organization
I-ORG organization
B-LOC Beginning of a location right after another location
I-LOC Location

CoNLL-2003 英文数据集统计

该数据集源自于路透社语料库,包含路透社的新闻报道。您可以在 CoNLL-2003 论文中了解有关该数据集的创建方式更多细节。

# of training examples per entity type
Dataset LOC MISC ORG PER
Train 7140 3438 6321 6600
Dev 1837 922 1341 1842
Test 1668 702 1661 1617
# of articles/sentences/tokens per dataset
Dataset Articles Sentences Tokens
Train 946 14,987 203,621
Dev 216 3,466 51,362
Test 231 3,684 46,435

Training procedure

该模型是使用一台 NVIDIA V100 GPU 进行训练的,使用了 original BERT paper 建议的超参数,并在 CoNLL-2003 NER 任务上对模型进行了训练和评估。

Eval results

metric dev test
f1 95.1 91.3
precision 95.0 90.7
recall 95.3 91.9

测试指标稍低于官方的 Google BERT 结果,后者对文档上下文进行了编码并尝试了 CRF。了解如何复现原始结果 here

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F.  and
      De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
    year = "2003",
    url = "https://www.aclweb.org/anthology/W03-0419",
    pages = "142--147",
}