数据集:
EMBO/SourceData
The largest annotated biomedical corpus for machine learning and AI in the publishing context.
SourceData is the largest annotated biomedical dataset for NER and NEL. It is unique on its focus on the core of scientific evidence: figure captions. It is also unique on its real-world configuration, since it does not present isolated sentences out of more general context. It offers full annotated figure captions that can be further enriched in context using full text, abstracts, or titles. The goal is to extract the nature of the experiments on them described. SourceData presents also its uniqueness by labelling the causal relationship between biological entities present in experiments, assigning experimental roles to each biomedical entity present in the corpus.
SourceData consistently annotates nine different biological entities (genes, proteins, cells, tissues, subcellular components, species, small molecules, and diseases). It is the first dataset annotating experimental assays and the roles played on them by the biological entities. Each entity is linked to their correspondent ontology, allowing for entity disambiguation and NEL.
@misc {embo_2023, author = { Abreu-Vicente, J. \& Lemberger, T. }, title = { The SourceData dataset}, year = 2023, url = { https://huggingface.co/datasets/EMBO/SourceData }, doi = { 10.57967/hf/0495 }, publisher = { Hugging Face } } @article {Liechti2017, author = {Liechti, Robin and George, Nancy and Götz, Lou and El-Gebali, Sara and Chasapi, Anastasia and Crespo, Isaac and Xenarios, Ioannis and Lemberger, Thomas}, title = {SourceData - a semantic platform for curating and searching figures}, year = {2017}, volume = {14}, number = {11}, doi = {10.1038/nmeth.4471}, URL = {https://doi.org/10.1038/nmeth.4471}, eprint = {https://www.biorxiv.org/content/early/2016/06/20/058529.full.pdf}, journal = {Nature Methods} }
from datasets import load_dataset # Load NER ds = load_dataset("EMBO/SourceData", "NER", version="1.0.0") # Load PANELIZATION ds = load_dataset("EMBO/SourceData", "PANELIZATION", version="1.0.0") # Load GENEPROD ROLES ds = load_dataset("EMBO/SourceData", "ROLES_GP", version="1.0.0") # Load SMALL MOLECULE ROLES ds = load_dataset("EMBO/SourceData", "ROLES_SM", version="1.0.0") # Load MULTI ROLES ds = load_dataset("EMBO/SourceData", "ROLES_MULTI", version="1.0.0")
Tags are provided as IOB2-style tags . PANELIZATION : figure captions (or figure legends) are usually composed of segments that each refer to one of several 'panels' of the full figure. Panels tend to represent results obtained with a coherent method and depicts data points that can be meaningfully compared to each other. PANELIZATION provide the start (B-PANEL_START) of these segments and allow to train for recogntion of the boundary between consecutive panel lengends. NER : biological and chemical entities are labeled. Specifically the following entities are tagged:
In the case of experimental roles, it is generated separatedly for GENEPROD and SMALL_MOL and there is also the ROLES_MULTI that takes both at the same time.
The text in the dataset is English.
DatasetDict({ train: Dataset({ features: ['words', 'labels', 'tag_mask', 'text'], num_rows: 55250 }) test: Dataset({ features: ['words', 'labels', 'tag_mask', 'text'], num_rows: 6844 }) validation: Dataset({ features: ['words', 'labels', 'tag_mask', 'text'], num_rows: 7951 }) })
DatasetDict({ train: Dataset({ features: ['words', 'labels', 'tag_mask'], num_rows: 14655 }) test: Dataset({ features: ['words', 'labels', 'tag_mask'], num_rows: 1871 }) validation: Dataset({ features: ['words', 'labels', 'tag_mask'], num_rows: 2088 }) })
The dataset was built to train models for the automatic extraction of a knowledge graph based from the scientific literature. The dataset can be used to train models for text segmentation, named entity recognition and semantic role labeling.
Figure legends were annotated according to the SourceData framework described in Liechti et al 2017 (Nature Methods, 2017, https://doi.org/10.1038/nmeth.4471 ). The curation tool at https://curation.sourcedata.io was used to segment figure legends into panel legends, tag enities, assign experiemental roles and normalize with standard identifiers (not available in this dataset). The source data was downloaded from the SourceData API ( https://api.sourcedata.io ) on 21 Jan 2021.
Who are the source language producers?The examples are extracted from the figure legends from scientific papers in cell and molecular biology.
The annotations were produced manually with expert curators from the SourceData project ( https://sourcedata.embo.org )
Who are the annotators?Curators of the SourceData project.
None known.
Not applicable.
The examples are heavily biased towards cell and molecular biology and are enriched in examples from papers published in EMBO Press journals ( https://embopress.org )
The annotation of diseases has been added recently to the dataset. Although they appear, the number is very low and they are not consistently tagged through the entire dataset. We recommend to use the diseases by filtering the examples that contain them.
[More Information Needed]
Thomas Lemberger, EMBO. Jorge Abreu Vicente, EMBO
CC BY 4.0
We are currently working on a paper to present the dataset. It is expected to be ready by 2023 spring. In the meantime, the following paper should be cited.
@article {Liechti2017, author = {Liechti, Robin and George, Nancy and Götz, Lou and El-Gebali, Sara and Chasapi, Anastasia and Crespo, Isaac and Xenarios, Ioannis and Lemberger, Thomas}, title = {SourceData - a semantic platform for curating and searching figures}, year = {2017}, volume = {14}, number = {11}, doi = {10.1038/nmeth.4471}, URL = {https://doi.org/10.1038/nmeth.4471}, eprint = {https://www.biorxiv.org/content/early/2016/06/20/058529.full.pdf}, journal = {Nature Methods} }
Thanks to @tlemberger and @drAbreu for adding this dataset.