模型:
Babelscape/wikineural-multilingual-ner
这是《EMNLP 2021》论文 WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER 的模型卡片。我们在我们的 WikiNEuRal dataset 上对多语言语言模型(mBERT)进行了3个时期的微调,用于命名实体识别(NER)。由此产生的多语言NER模型支持WikiNEuRal涵盖的9种语言(德语,英语,西班牙语,法语,意大利语,荷兰语,波兰语,葡萄牙语,俄语),并且它是在所有9种语言上共同训练的。
如果您使用这个模型,请在您的论文中引用这项工作:
@inproceedings{tedeschi-etal-2021-wikineural-combined, title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}", author = "Tedeschi, Simone and Maiorca, Valentino and Campolungo, Niccol{\`o} and Cecconi, Francesco and Navigli, Roberto", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-emnlp.215", pages = "2521--2533", abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.", }
原始论文的资料库可以在此找到: https://github.com/Babelscape/wikineural 。
您可以使用 Transformers 管道
from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner") model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner") nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True) example = "My name is Wolfgang and I live in Berlin" ner_results = nlp(example) print(ner_results)进行命名实体识别。
这个模型是基于 WikiNEuRal 训练的,这是一个用于多语言NER的最先进数据集,它是从维基百科自动获取的。因此,它可能无法很好地推广到所有的文本类型(例如新闻)。另一方面,仅在新闻文章上训练的模型(例如仅在CoNLL03上)已被证明在百科文章上获得的分数要低得多。为了获得更强大的系统,我们建议您在WikiNEuRal与其他数据集(例如WikiNEuRal + CoNLL)的组合上训练系统。
此存储库的内容仅限于非商业研究目的,受 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) 的约束。数据集内容和模型的版权属于原版权持有人。