模型:

dslim/bert-base-NER

任务:

标记分类

类库:

PyTorch TensorFlow JAX Safetensors Transformers

数据集:

conll2003 3Aconll2003

语言:

其他:

bert AutoTrain Compatible

预印本库:

arxiv:1810.04805

许可:

mit

模型介绍文件清单

英文

bert-base-NER

Model description

bert-base-NER 是一个已经微调过的BERT模型，可直接用于命名实体识别任务，并在NER任务中达到了最先进的性能。它经过训练可以识别四种类型的实体：位置（LOC），组织（ORG），人名（PER）和其他（MISC）。

具体来说，这个模型是在标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本上微调得到的一个 bert-base-cased 模型。

如果您想使用在相同数据集上微调过的更大型的BERT-large模型，也可以使用版本编号为 bert-large-NER 的版本。

Intended uses & limitations

How to use

您可以通过 Transformers pipeline 来使用此模型进行NER。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

Limitations and bias

该模型的训练数据集是特定时间段内的实体标注新闻文章，因此在不同领域的所有用例中可能不具有良好的泛化能力。此外，模型有时会将子单词标记为实体，因此可能需要对结果进行后处理来处理这些情况。

Training data

该模型是在标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本上微调得到的。

训练数据集区分实体的开始和连续，因此如果有连续的相同类型实体，模型可以输出第二个实体的开始位置。与数据集一样，每个标记将被归类为以下类别之一：

Abbreviation	Description
O	Outside of a named entity
B-MIS	Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS	Miscellaneous entity
B-PER	Beginning of a person’s name right after another person’s name
I-PER	Person’s name
B-ORG	Beginning of an organization right after another organization
I-ORG	organization
B-LOC	Beginning of a location right after another location
I-LOC	Location

CoNLL-2003 英文数据集统计

该数据集源自于路透社语料库，包含路透社的新闻报道。您可以在 CoNLL-2003 论文中了解有关该数据集的创建方式更多细节。

# of training examples per entity type

Dataset	LOC	MISC	ORG	PER
Train	7140	3438	6321	6600
Dev	1837	922	1341	1842
Test	1668	702	1661	1617

# of articles/sentences/tokens per dataset

Dataset	Articles	Sentences	Tokens
Train	946	14,987	203,621
Dev	216	3,466	51,362
Test	231	3,684	46,435

Training procedure

该模型是使用一台 NVIDIA V100 GPU 进行训练的，使用了 original BERT paper 建议的超参数，并在 CoNLL-2003 NER 任务上对模型进行了训练和评估。

Eval results

metric	dev	test
f1	95.1	91.3
precision	95.0	90.7
recall	95.3	91.9

测试指标稍低于官方的 Google BERT 结果，后者对文档上下文进行了编码并尝试了 CRF。了解如何复现原始结果 here 。

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F.  and
      De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
    year = "2003",
    url = "https://www.aclweb.org/anthology/W03-0419",
    pages = "142--147",
}

作者:

David S. Lim

数据集大小:

1.61 GB