⚠️ WARNING : THIS VERSION OF THE DATASET IS MODIFIED IN FORMAT AND CONTENT FROM THE ORIGINAL DATASET AVAILABLE HERE . NESTED ENTITIES HAVE BEEN REMOVED AND THIS DATASET ONLY RETAINS THE LARGEST OF NESTED ENTITIES. OVERALL, THIS CORRESPONDS TO 80% OF THE ENTITIES ANNOTATED IN THE ORIGINAL DATASET. ⚠️
The QUAERO French Medical Corpus has been initially developed as a resource for named entity recognition and normalization [1]. It was then improved with the purpose of creating a gold standard set of normalized entities for French biomedical text, that was used in the CLEF eHealth evaluation lab [2][3].
A selection of MEDLINE titles and EMEA documents were manually annotated. The annotation process was guided by concepts in the Unified Medical Language System (UMLS):
Ten types of clinical entities, as defined by the following UMLS Semantic Groups (Bodenreider and McCray 2003) were annotated: Anatomy (ANAT), Chemical and Drugs (CHEM), Devices (DEVI), Disorders (DISO), Geographic Areas (GEOG), Living Beings (LIVB), Objects (OBJC), Phenomena (PHEN), Physiology (PHYS), Procedures (PROC).
The annotations were made in a comprehensive fashion, so that nested entities were marked, and entities could be mapped to more than one UMLS concept. In particular: (a) If a mention can refer to more than one Semantic Group, all the relevant Semantic Groups should be annotated. For instance, the mention “récidive” (recurrence) in the phrase “prévention des récidives” (recurrence prevention) should be annotated with the category “DISORDER” (CUI C2825055) and the category “PHENOMENON” (CUI C0034897); (b) If a mention can refer to more than one UMLS concept within the same Semantic Group, all the relevant concepts should be annotated. For instance, the mention “maniaques” (obsessive) in the phrase “patients maniaques” (obsessive patients) should be annotated with CUIs C0564408 and C0338831 (category “DISORDER”); (c) Entities which span overlaps with that of another entity should still be annotated. For instance, in the phrase “infarctus du myocarde” (myocardial infarction), the mention “myocarde” (myocardium) should be annotated with category “ANATOMY” (CUI C0027061) and the mention “infarctus du myocarde” should be annotated with category “DISORDER” (CUI C0027051)
For more details, please refer to the official webpage .
⚠️ WARNING : THIS VERSION OF THE DATASET IS MODIFIED IN FORMAT AND CONTENT FROM THE ORIGINAL DATASET AVAILABLE HERE . NESTED ENTITIES HAVE BEEN REMOVED AND THIS DATASET ONLY RETAINS THE LARGEST OF NESTED ENTITIES. OVERALL, THIS CORRESPONDS TO 80% OF THE ENTITIES ANNOTATED IN THE ORIGINAL DATASET. ⚠️
In this format, each word of the sentence has an associated ner_tag, corresponding to the type of clinical entity, here is the mapping :
0: "O", 1: "ANAT", 2: "LIVB", 3: "DISO", 4: "PROC", 5: "CHEM", 6: "GEOG", 7: "PHYS", 8: "PHEN", 9: "OBJC", 10: "DEVI"
[1] Névéol A, Grouin C, Leixa J, Rosset S, Zweigenbaum P. The QUAERO French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization. Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM2014. 2014:24-30
[2] Névéol A, Grouin C, Tannier X, Hamon T, Kelly L, Goeuriot L, Zweigenbaum P. (2015) Task 1b of the CLEF eHealth Evaluation Lab 2015: Clinical Named Entity Recognition. CLEF 2015 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September, 2015.
[3] Névéol A, Cohen, KB, Grouin C, Hamon T, Lavergne T, Kelly L, Goeuriot L, Rey G, Robert A, Tannier X, Zweigenbaum P. Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016. CLEF 2016, Online Working Notes, CEUR-WS 1609.2016:28-42.