英文

ScandiNER - 用于斯堪的纳维亚语言的命名实体识别模型

此模型是基于 NbAiLab/nb-bert-base 进行微调的斯堪的纳维亚语言(包括丹麦语、挪威语(包括博克马尔语和尼诺斯克语)、瑞典语、冰岛语和法罗语)命名实体识别模型。它已经在 DaNE NorNE SUC 3.0 以及 WikiANN 数据集的冰岛语和法罗语部分的合并数据上进行了微调训练。鉴于预训练模型也是在英语数据和斯堪的纳维亚语言数据上进行训练的,它在英语句子上的表现也相当不错。

该模型将预测以下四个实体:

Tag Name Description
PER Person The name of a person (e.g., Birgitte and Mohammed )
LOC Location The name of a location (e.g., Tyskland and Djurgården )
ORG Organisation The name of an organisation (e.g., Bunnpris and Landsbankinn )
MISC Miscellaneous A named entity of a different kind (e.g., Ūjķnustu pund and Mona Lisa )

快速开始

您可以按以下方式在脚本中使用此模型:

>>> from transformers import pipeline
>>> import pandas as pd
>>> ner = pipeline(task='ner', 
...                model='saattrupdan/nbailab-base-ner-scandi', 
...                aggregation_strategy='first')
>>> result = ner('Borghild kjøper seg inn i Bunnpris')
>>> pd.DataFrame.from_records(result)
  entity_group     score      word  start  end
0          PER  0.981257  Borghild      0    8
1          ORG  0.974099  Bunnpris     26   34

性能

下面是斯堪的纳维亚语言命名实体识别测试数据集的Micro-F1 NER性能,与当前最先进模型进行了比较。这些模型在测试集和9个自举版本的测试集上进行了评估,这里显示了平均值和95%的置信区间:

Model ID DaNE NorNE-NB NorNE-NN SUC 3.0 WikiANN-IS WikiANN-FO Average
saattrupdan/nbailab-base-ner-scandi 87.44 ± 0.81 91.06 ± 0.26 90.42 ± 0.61 88.37 ± 0.17 88.61 ± 0.41 90.22 ± 0.46 89.08 ± 0.46
chcaa/da_dacy_large_trf 83.61 ± 1.18 78.90 ± 0.49 72.62 ± 0.58 53.35 ± 0.17 50.57 ± 0.46 51.72 ± 0.52 63.00 ± 0.57
RecordedFuture/Swedish-NER 64.09 ± 0.97 61.74 ± 0.50 56.67 ± 0.79 66.60 ± 0.27 34.54 ± 0.73 42.16 ± 0.83 53.32 ± 0.69
Maltehb/danish-bert-botxo-ner-dane 69.25 ± 1.17 60.57 ± 0.27 35.60 ± 1.19 38.37 ± 0.26 21.00 ± 0.57 27.88 ± 0.48 40.92 ± 0.64
Maltehb/-l-ctra-danish-electra-small-uncased-ner-dane 70.41 ± 1.19 48.76 ± 0.70 27.58 ± 0.61 35.39 ± 0.38 26.22 ± 0.52 28.30 ± 0.29 39.70 ± 0.61
radbrt/nb_nocy_trf 56.82 ± 1.63 68.20 ± 0.75 69.22 ± 1.04 31.63 ± 0.29 20.32 ± 0.45 12.91 ± 0.50 38.08 ± 0.75

除了高准确性外,它还比之前的最先进模型更小更快:

Model ID Samples/second Model size
saattrupdan/nbailab-base-ner-scandi 4.16 ± 0.18 676 MB
chcaa/da_dacy_large_trf 0.65 ± 0.01 2,090 MB

训练过程

训练超参数

训练过程中使用了以下超参数:

  • 学习率:2e-05
  • 训练批大小:8
  • 评估批大小:8
  • 种子:42
  • 梯度累积步数:4
  • 总训练批大小:32
  • 优化器:Adam,beta值为(0.9,0.999),epsilon为1e-08
  • lr_scheduler类型:linear
  • lr_scheduler_warmup_steps:90135.90000000001
  • 训练轮数:1000

训练结果

Training Loss Epoch Step Validation Loss Micro F1 Micro F1 No Misc
0.6682 1.0 2816 0.0872 0.6916 0.7306
0.0684 2.0 5632 0.0464 0.8167 0.8538
0.0444 3.0 8448 0.0367 0.8485 0.8783
0.0349 4.0 11264 0.0316 0.8684 0.8920
0.0282 5.0 14080 0.0290 0.8820 0.9033
0.0231 6.0 16896 0.0283 0.8854 0.9060
0.0189 7.0 19712 0.0253 0.8964 0.9156
0.0155 8.0 22528 0.0260 0.9016 0.9201
0.0123 9.0 25344 0.0266 0.9059 0.9233
0.0098 10.0 28160 0.0280 0.9091 0.9279
0.008 11.0 30976 0.0309 0.9093 0.9287
0.0065 12.0 33792 0.0313 0.9103 0.9284
0.0053 13.0 36608 0.0322 0.9078 0.9257
0.0046 14.0 39424 0.0343 0.9075 0.9256

框架版本

  • Transformers 4.10.3
  • Pytorch 1.9.0+cu102
  • Datasets 1.12.1
  • Tokenizers 0.10.3