A BERT model pre-trained on PubMed abstracts
Please see https://github.com/ncbi-nlp/bluebert
We provide preprocessed PubMed texts that were used to pre-train the BlueBERT models. The corpus contains ~4000M words extracted from the PubMed ASCII code version .
Pre-trained model: https://huggingface.co/bert-base-uncased
Below is a code snippet for more details.
value = value.lower() value = re.sub(r'[\r\n]+', ' ', value) value = re.sub(r'[^\x00-\x7F]+', ' ', value) tokenized = TreebankWordTokenizer().tokenize(value) sentence = ' '.join(tokenized) sentence = re.sub(r"\s's\b", "'s", sentence)
@InProceedings{peng2019transfer, author = {Yifan Peng and Shankai Yan and Zhiyong Lu}, title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets}, booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)}, year = {2019}, pages = {58--65}, }