WikiNEuRal: 多语言命名实体识别的综合神经网络和基于知识的银标数据创建

这是《EMNLP 2021》论文 WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER 的模型卡片。我们在我们的 WikiNEuRal dataset 上对多语言语言模型（mBERT）进行了3个时期的微调，用于命名实体识别（NER）。由此产生的多语言NER模型支持WikiNEuRal涵盖的9种语言（德语，英语，西班牙语，法语，意大利语，荷兰语，波兰语，葡萄牙语，俄语），并且它是在所有9种语言上共同训练的。

如果您使用这个模型，请在您的论文中引用这项工作：

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}

原始论文的资料库可以在此找到： https://github.com/Babelscape/wikineural 。

如何使用

您可以使用 Transformers 管道

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

进行命名实体识别。

限制和偏见

这个模型是基于 WikiNEuRal 训练的，这是一个用于多语言NER的最先进数据集，它是从维基百科自动获取的。因此，它可能无法很好地推广到所有的文本类型（例如新闻）。另一方面，仅在新闻文章上训练的模型（例如仅在CoNLL03上）已被证明在百科文章上获得的分数要低得多。为了获得更强大的系统，我们建议您在WikiNEuRal与其他数据集（例如WikiNEuRal + CoNLL）的组合上训练系统。

许可信息

此存储库的内容仅限于非商业研究目的，受 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) 的约束。数据集内容和模型的版权属于原版权持有人。

作者:

Babelscape

数据集大小:

1.32 GB