This dataset contains 422,070 short, computer-generated definitions for SnomedCT concepts, covering various domains such as diseases, procedures, drugs, and anatomy. To do so, we prompted the OpenAI Turbo model, a variant of GPT 3.5, using a high-quality verbalization of the SnomedCT relationships of the to-be-defined concept.
IMPORTANT: Following a quality control, we report that the definitions include a majority of factual, insightful, and fluent definitions. However, about 30% of the definitions generated by this procedure do not meet the high standards required for presentation to users, or for usage by machine learning models in scenarios requiring reasoning, due to their imperfect quality. However, more than 95% of the definitions appear useful for biomedical model pre-training. We therefore release this dataset for building retrieval-based systems, and evaluate large biomedical language models on the definition-generation task (and eventually for low-rank finetuning of existing language models).
The license for this work is subject to both SnomedCT and OpenAI API agreements. We strongly recommend checking those licenses before making use of this dataset.
If you use this dataset, please cite the following work: TODO: To appear at BioNLP 2023
@misc{remy-and-demeester-2023-glossary, title = "Automatic Glossary of Clinical Terminology: a Large-Scale Dictionary of Biomedical Definitions Generated from Ontological Knowledge", author = "Remy, François and Demeester, Thomas", year = 2023 }