模型:
alabnii/jmedroberta-base-sentencepiece
This is a Japanese RoBERTa base model pre-trained on academic articles in medical sciences collected by Japan Science and Technology Agency (JST).
This model is released under the Creative Commons 4.0 International License (CC BY-NC-SA 4.0).
ReferenceJa:
@InProceedings{sugimoto_nlp2023_jmedroberta,
author = "杉本海人 and 壹岐太一 and 知田悠生 and 金沢輝一 and 相澤彰子",
title = "J{M}ed{R}o{BERT}a: 日本語の医学論文にもとづいた事前学習済み言語モデルの構築と評価",
booktitle = "言語処理学会第29回年次大会",
year = "2023",
url = "https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P3-1.pdf"
}
En:
@InProceedings{sugimoto_nlp2023_jmedroberta,
author = "Sugimoto, Kaito and Iki, Taichi and Chida, Yuki and Kanazawa, Teruhito and Aizawa, Akiko",
title = "J{M}ed{R}o{BERT}a: a Japanese Pre-trained Language Model on Academic Articles in Medical Sciences (in Japanese)",
booktitle = "Proceedings of the 29th Annual Meeting of the Association for Natural Language Processing",
year = "2023",
url = "https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P3-1.pdf"
}
Input text must be converted to full-width characters(全角)in advance.
You can use this model for masked language modeling as follows:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("alabnii/jmedroberta-base-sentencepiece")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("alabnii/jmedroberta-base-sentencepiece")
texts = ['この患者は[MASK]と診断された。']
inputs = tokenizer.batch_encode_plus(texts, return_tensors='pt')
outputs = model(**inputs)
tokenizer.convert_ids_to_tokens(outputs.logits[0][1:-1].argmax(axis=-1))
# ['▁この', '患者は', 'AML', '▁', 'と診断された', '。']
Alternatively, you can employ Fill-mask pipeline .
from transformers import pipeline
fill = pipeline("fill-mask", model="alabnii/jmedroberta-base-sentencepiece", top_k=10)
fill("この患者は[MASK]と診断された。")
#[{'score': 0.04239409416913986,
# 'token': 7698,
# 'token_str': 'AML',
# 'sequence': 'この患者はAML と診断された。'},
# {'score': 0.03562006726861,
# 'token': 3298,
# 'token_str': 'SLE',
# 'sequence': 'この患者はSLE と診断された。'},
# {'score': 0.025064188987016678,
# 'token': 10303,
# 'token_str': 'MDS',
# 'sequence': 'この患者はMDS と診断された。'},
# ...
You can fine-tune this model on downstream tasks.
See also sample Colab notebooks: https://colab.research.google.com/drive/1BUD3DKOUMqcwIO3X5bYUOsR_wDzgOJcd?usp=sharing
Each sentence is tokenized into tokens by SentencePiece (Unigram) .
The vocabulary consists of 30000 tokens induced by SentencePiece (Unigram) .
The following hyperparameters were used during pre-training:
As the config file suggests, our model is based on HuggingFace's BertForMaskedLM class. However, we consider our model as RoBERTa for the following reasons:
This work was supported by Japan Japan Science and Technology Agency (JST) AIP Trilateral AI Research (Grant Number: JPMJCR20G9), and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) (Project ID: jh221004), in Japan. In this research work, we used the " mdx: a platform for the data-driven future ".