模型:
saattrupdan/nbailab-base-ner-scandi
此模型是基于 NbAiLab/nb-bert-base 进行微调的斯堪的纳维亚语言(包括丹麦语、挪威语(包括博克马尔语和尼诺斯克语)、瑞典语、冰岛语和法罗语)命名实体识别模型。它已经在 DaNE 、 NorNE 、 SUC 3.0 以及 WikiANN 数据集的冰岛语和法罗语部分的合并数据上进行了微调训练。鉴于预训练模型也是在英语数据和斯堪的纳维亚语言数据上进行训练的,它在英语句子上的表现也相当不错。
该模型将预测以下四个实体:
Tag | Name | Description |
---|---|---|
PER | Person | The name of a person (e.g., Birgitte and Mohammed ) |
LOC | Location | The name of a location (e.g., Tyskland and Djurgården ) |
ORG | Organisation | The name of an organisation (e.g., Bunnpris and Landsbankinn ) |
MISC | Miscellaneous | A named entity of a different kind (e.g., Ūjķnustu pund and Mona Lisa ) |
您可以按以下方式在脚本中使用此模型:
>>> from transformers import pipeline >>> import pandas as pd >>> ner = pipeline(task='ner', ... model='saattrupdan/nbailab-base-ner-scandi', ... aggregation_strategy='first') >>> result = ner('Borghild kjøper seg inn i Bunnpris') >>> pd.DataFrame.from_records(result) entity_group score word start end 0 PER 0.981257 Borghild 0 8 1 ORG 0.974099 Bunnpris 26 34
下面是斯堪的纳维亚语言命名实体识别测试数据集的Micro-F1 NER性能,与当前最先进模型进行了比较。这些模型在测试集和9个自举版本的测试集上进行了评估,这里显示了平均值和95%的置信区间:
Model ID | DaNE | NorNE-NB | NorNE-NN | SUC 3.0 | WikiANN-IS | WikiANN-FO | Average |
---|---|---|---|---|---|---|---|
saattrupdan/nbailab-base-ner-scandi | 87.44 ± 0.81 | 91.06 ± 0.26 | 90.42 ± 0.61 | 88.37 ± 0.17 | 88.61 ± 0.41 | 90.22 ± 0.46 | 89.08 ± 0.46 |
chcaa/da_dacy_large_trf | 83.61 ± 1.18 | 78.90 ± 0.49 | 72.62 ± 0.58 | 53.35 ± 0.17 | 50.57 ± 0.46 | 51.72 ± 0.52 | 63.00 ± 0.57 |
RecordedFuture/Swedish-NER | 64.09 ± 0.97 | 61.74 ± 0.50 | 56.67 ± 0.79 | 66.60 ± 0.27 | 34.54 ± 0.73 | 42.16 ± 0.83 | 53.32 ± 0.69 |
Maltehb/danish-bert-botxo-ner-dane | 69.25 ± 1.17 | 60.57 ± 0.27 | 35.60 ± 1.19 | 38.37 ± 0.26 | 21.00 ± 0.57 | 27.88 ± 0.48 | 40.92 ± 0.64 |
Maltehb/-l-ctra-danish-electra-small-uncased-ner-dane | 70.41 ± 1.19 | 48.76 ± 0.70 | 27.58 ± 0.61 | 35.39 ± 0.38 | 26.22 ± 0.52 | 28.30 ± 0.29 | 39.70 ± 0.61 |
radbrt/nb_nocy_trf | 56.82 ± 1.63 | 68.20 ± 0.75 | 69.22 ± 1.04 | 31.63 ± 0.29 | 20.32 ± 0.45 | 12.91 ± 0.50 | 38.08 ± 0.75 |
除了高准确性外,它还比之前的最先进模型更小更快:
Model ID | Samples/second | Model size |
---|---|---|
saattrupdan/nbailab-base-ner-scandi | 4.16 ± 0.18 | 676 MB |
chcaa/da_dacy_large_trf | 0.65 ± 0.01 | 2,090 MB |
训练过程中使用了以下超参数:
Training Loss | Epoch | Step | Validation Loss | Micro F1 | Micro F1 No Misc |
---|---|---|---|---|---|
0.6682 | 1.0 | 2816 | 0.0872 | 0.6916 | 0.7306 |
0.0684 | 2.0 | 5632 | 0.0464 | 0.8167 | 0.8538 |
0.0444 | 3.0 | 8448 | 0.0367 | 0.8485 | 0.8783 |
0.0349 | 4.0 | 11264 | 0.0316 | 0.8684 | 0.8920 |
0.0282 | 5.0 | 14080 | 0.0290 | 0.8820 | 0.9033 |
0.0231 | 6.0 | 16896 | 0.0283 | 0.8854 | 0.9060 |
0.0189 | 7.0 | 19712 | 0.0253 | 0.8964 | 0.9156 |
0.0155 | 8.0 | 22528 | 0.0260 | 0.9016 | 0.9201 |
0.0123 | 9.0 | 25344 | 0.0266 | 0.9059 | 0.9233 |
0.0098 | 10.0 | 28160 | 0.0280 | 0.9091 | 0.9279 |
0.008 | 11.0 | 30976 | 0.0309 | 0.9093 | 0.9287 |
0.0065 | 12.0 | 33792 | 0.0313 | 0.9103 | 0.9284 |
0.0053 | 13.0 | 36608 | 0.0322 | 0.9078 | 0.9257 |
0.0046 | 14.0 | 39424 | 0.0343 | 0.9075 | 0.9256 |