数据集:
PlanTL-GOB-ES/pharmaconer
Manually classified collection of Spanish clinical case studies.
Manually classified collection of clinical case studies derived from the Spanish Clinical Case Corpus (SPACCC), an open access electronic library that gathers Spanish medical publications from SciELO .
The PharmaCoNER corpus contains a total of 396,988 words and 1,000 clinical cases that have been randomly sampled into 3 subsets. The training set contains 500 clinical cases, while the development and test sets contain 250 clinical cases each. In terms of training examples, this translates to a total of 8129, 3787 and 3952 annotated sentences in each set. The original dataset is distributed in Brat format.
The annotation of the entire set of entity mentions was carried out by domain experts. It includes the following 4 entity types: NORMALIZABLES, NO_NORMALIZABLES, PROTEINAS and UNCLEAR.
This dataset was designed for the PharmaCoNER task, sponsored by Plan-TL .
For further information, please visit the official website .
Named Entity Recognition (NER)
Three four-column files, one for each split.
Every file has four columns:
La S0004-06142006000900008-1 123_125 O paciente S0004-06142006000900008-1 126_134 O tenía S0004-06142006000900008-1 135_140 O antecedentes S0004-06142006000900008-1 141_153 O de S0004-06142006000900008-1 154_156 O hipotiroidismo S0004-06142006000900008-1 157_171 O , S0004-06142006000900008-1 171_172 O hipertensión S0004-06142006000900008-1 173_185 O arterial S0004-06142006000900008-1 186_194 O en S0004-06142006000900008-1 195_197 O tratamiento S0004-06142006000900008-1 198_209 O habitual S0004-06142006000900008-1 210_218 O con S0004-06142006000900008-1 219-222 O atenolol S0004-06142006000900008-1 223_231 B-NORMALIZABLES y S0004-06142006000900008-1 232_233 O enalapril S0004-06142006000900008-1 234_243 B-NORMALIZABLES
Split | Size |
---|---|
train | 8,129 |
dev | 3,787 |
test | 3,952 |
For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines.
Manually classified collection of clinical case report sections. The clinical cases were not restricted to a single medical discipline, covering a variety of medical disciplines, including oncology, urology, cardiology, pneumology or infectious diseases. This is key to cover a diverse set of chemicals and drugs.
Who are the source language producers?Humans, there is no machine generated data.
The annotation process of the PharmaCoNER corpus was inspired by previous annotation schemes and corpora used for the BioCreative CHEMDNER and GPRO tracks, translating the guidelines used for these tracks into Spanish and adapting them to the characteristics and needs of clinically oriented documents by modifying the annotation criteria and rules to cover medical information needs. This adaptation was carried out in collaboration with practicing physicians and medicinal chemistry experts. The adaptation, translation and refinement of the guidelines was done on a sample set of the SPACCC corpus and linked to an iterative process of annotation consistency analysis through interannotator agreement (IAA) studies until a high annotation quality in terms of IAA was reached.
Who are the annotators?Practicing physicians and medicinal chemistry experts.
No personal or sensitive information included.
This corpus contributes to the development of medical language models in Spanish.
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es ).
For further information, send an email to ( plantl-gob-es@bsc.es ).
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL .
This work is licensed under CC Attribution 4.0 International License.
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
@inproceedings{, title = "PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track", author = "Gonzalez-Agirre, Aitor and Marimon, Montserrat and Intxaurrondo, Ander and Rabal, Obdulia and Villegas, Marta and Krallinger, Martin", booktitle = "Proceedings of The 5th Workshop on BioNLP Open Shared Tasks", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D19-5701", doi = "10.18653/v1/D19-5701", pages = "1--10", }
[N/A]