数据集:

EMBO/SourceData

语言:

en

大小:

10K<n<100K

数字对象标识符:

10.57967/hf/0495

许可:

cc-by-4.0
中文

SourceData Dataset

The largest annotated biomedical corpus for machine learning and AI in the publishing context.

SourceData is the largest annotated biomedical dataset for NER and NEL. It is unique on its focus on the core of scientific evidence: figure captions. It is also unique on its real-world configuration, since it does not present isolated sentences out of more general context. It offers full annotated figure captions that can be further enriched in context using full text, abstracts, or titles. The goal is to extract the nature of the experiments on them described. SourceData presents also its uniqueness by labelling the causal relationship between biological entities present in experiments, assigning experimental roles to each biomedical entity present in the corpus.

SourceData consistently annotates nine different biological entities (genes, proteins, cells, tissues, subcellular components, species, small molecules, and diseases). It is the first dataset annotating experimental assays and the roles played on them by the biological entities. Each entity is linked to their correspondent ontology, allowing for entity disambiguation and NEL.

Cite our work

@misc {embo_2023,
    author       = { Abreu-Vicente, J. \& Lemberger, T. },
    title        = { The SourceData dataset},
    year         = 2023,
    url          = { https://huggingface.co/datasets/EMBO/SourceData },
    doi          = { 10.57967/hf/0495 },
    publisher    = { Hugging Face }
}

@article {Liechti2017,
     author = {Liechti, Robin and George, Nancy and Götz, Lou and El-Gebali, Sara and Chasapi, Anastasia and Crespo, Isaac and Xenarios, Ioannis and Lemberger, Thomas},
     title = {SourceData - a semantic platform for curating and searching figures},
     year = {2017},
     volume = {14},
     number = {11},
     doi = {10.1038/nmeth.4471},
     URL = {https://doi.org/10.1038/nmeth.4471},
     eprint = {https://www.biorxiv.org/content/early/2016/06/20/058529.full.pdf},
     journal = {Nature Methods}
}

Dataset usage

  from datasets import load_dataset
  # Load NER
  ds = load_dataset("EMBO/SourceData", "NER", version="1.0.0")
  # Load PANELIZATION
  ds = load_dataset("EMBO/SourceData", "PANELIZATION", version="1.0.0")
  # Load GENEPROD ROLES
  ds = load_dataset("EMBO/SourceData", "ROLES_GP", version="1.0.0")
  # Load SMALL MOLECULE ROLES
  ds = load_dataset("EMBO/SourceData", "ROLES_SM", version="1.0.0")
  # Load MULTI ROLES
  ds = load_dataset("EMBO/SourceData", "ROLES_MULTI", version="1.0.0")

Supported Tasks and Leaderboards

Tags are provided as IOB2-style tags . PANELIZATION : figure captions (or figure legends) are usually composed of segments that each refer to one of several 'panels' of the full figure. Panels tend to represent results obtained with a coherent method and depicts data points that can be meaningfully compared to each other. PANELIZATION provide the start (B-PANEL_START) of these segments and allow to train for recogntion of the boundary between consecutive panel lengends. NER : biological and chemical entities are labeled. Specifically the following entities are tagged:

  • SMALL_MOLECULE : small molecules
  • GENEPROD : gene products (genes and proteins)
  • SUBCELLULAR : subcellular components
  • CELL_LINE : cell lines
  • CELL_TYPE : cell types
  • TISSUE : tissues and organs
  • ORGANISM : species
  • DISEASE : diseases (see limitations)
  • EXP_ASSAY : experimental assays ROLES : the role of entities with regard to the causal hypotheses tested in the reported results. The tags are:
  • CONTROLLED_VAR : entities that are associated with experimental variables and that subjected to controlled and targeted perturbations.
  • MEASURED_VAR : entities that are associated with the variables measured and the object of the measurements.

In the case of experimental roles, it is generated separatedly for GENEPROD and SMALL_MOL and there is also the ROLES_MULTI that takes both at the same time.

Languages

The text in the dataset is English.

Dataset Structure

Data Instances

Data Fields

  • words : list of strings text tokenized into words.
  • panel_id : ID of the panel to which the example belongs to in the SourceData database.
  • label_ids :
    • entity_types : list of strings for the IOB2 tags for entity type; possible value in ["O", "I-SMALL_MOLECULE", "B-SMALL_MOLECULE", "I-GENEPROD", "B-GENEPROD", "I-SUBCELLULAR", "B-SUBCELLULAR", "I-CELL_LINE", "B-CELL_LINE", "I-CELL_TYPE", "B-CELL_TYPE", "I-TISSUE", "B-TISSUE", "I-ORGANISM", "B-ORGANISM", "I-EXP_ASSAY", "B-EXP_ASSAY"]
    • roles : list of strings for the IOB2 tags for experimental roles; values in ["O", "I-CONTROLLED_VAR", "B-CONTROLLED_VAR", "I-MEASURED_VAR", "B-MEASURED_VAR"]
    • panel_start : list of strings for IOB2 tags ["O", "B-PANEL_START"]
    • multi roles : There are two different label sets. labels is like in roles . is_category tags GENEPROD and SMALL_MOLECULE .

Data Splits

  • NER and ROLES
  DatasetDict({
      train: Dataset({
          features: ['words', 'labels', 'tag_mask', 'text'],
          num_rows: 55250
      })
      test: Dataset({
          features: ['words', 'labels', 'tag_mask', 'text'],
          num_rows: 6844
      })
      validation: Dataset({
          features: ['words', 'labels', 'tag_mask', 'text'],
          num_rows: 7951
      })
  })
  • PANELIZATION
  DatasetDict({
      train: Dataset({
          features: ['words', 'labels', 'tag_mask'],
          num_rows: 14655
      })
      test: Dataset({
          features: ['words', 'labels', 'tag_mask'],
          num_rows: 1871
      })
      validation: Dataset({
          features: ['words', 'labels', 'tag_mask'],
          num_rows: 2088
      })
  })

Dataset Creation

Curation Rationale

The dataset was built to train models for the automatic extraction of a knowledge graph based from the scientific literature. The dataset can be used to train models for text segmentation, named entity recognition and semantic role labeling.

Source Data

Initial Data Collection and Normalization

Figure legends were annotated according to the SourceData framework described in Liechti et al 2017 (Nature Methods, 2017, https://doi.org/10.1038/nmeth.4471 ). The curation tool at https://curation.sourcedata.io was used to segment figure legends into panel legends, tag enities, assign experiemental roles and normalize with standard identifiers (not available in this dataset). The source data was downloaded from the SourceData API ( https://api.sourcedata.io ) on 21 Jan 2021.

Who are the source language producers?

The examples are extracted from the figure legends from scientific papers in cell and molecular biology.

Annotations

Annotation process

The annotations were produced manually with expert curators from the SourceData project ( https://sourcedata.embo.org )

Who are the annotators?

Curators of the SourceData project.

Personal and Sensitive Information

None known.

Considerations for Using the Data

Social Impact of Dataset

Not applicable.

Discussion of Biases

The examples are heavily biased towards cell and molecular biology and are enriched in examples from papers published in EMBO Press journals ( https://embopress.org )

The annotation of diseases has been added recently to the dataset. Although they appear, the number is very low and they are not consistently tagged through the entire dataset. We recommend to use the diseases by filtering the examples that contain them.

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Thomas Lemberger, EMBO. Jorge Abreu Vicente, EMBO

Licensing Information

CC BY 4.0

Citation Information

We are currently working on a paper to present the dataset. It is expected to be ready by 2023 spring. In the meantime, the following paper should be cited.

  @article {Liechti2017,
      author = {Liechti, Robin and George, Nancy and Götz, Lou and El-Gebali, Sara and Chasapi, Anastasia and Crespo, Isaac and Xenarios, Ioannis and Lemberger, Thomas},
      title = {SourceData - a semantic platform for curating and searching figures},
      year = {2017},
    volume = {14},
    number = {11},
      doi = {10.1038/nmeth.4471},
      URL = {https://doi.org/10.1038/nmeth.4471},
      eprint = {https://www.biorxiv.org/content/early/2016/06/20/058529.full.pdf},
      journal = {Nature Methods}
  }

Contributions

Thanks to @tlemberger and @drAbreu for adding this dataset.