数据集:

qanastek/Biosses-BLUE

语言:

en

计算机处理:

monolingual

大小:

n<1K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

gpl-3.0
中文

Dataset Card for BIOSSES

Dataset Summary

BIOSSES is a benchmark dataset for biomedical sentence similarity estimation. The dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. The sentence pairs in BIOSSES were selected from citing sentences, i.e. sentences that have a citation to a reference article.

The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores ranging from 0 (no relation) to 4 (equivalent). In the original paper the mean of the scores assigned by the five human annotators was taken as the gold standard. The Pearson correlation between the gold standard scores and the scores estimated by the models was used as the evaluation metric. The strength of correlation can be assessed by the general guideline proposed by Evans (1996) as follows:

  • very strong: 0.80–1.00
  • strong: 0.60–0.79
  • moderate: 0.40–0.59
  • weak: 0.20–0.39
  • very weak: 0.00–0.19

Data Splits (From BLUE Benchmark)

name Train Dev Test
biosses 64 16 20

Supported Tasks and Leaderboards

Biomedical Semantic Similarity Scoring.

Languages

English.

Dataset Structure

Data Instances

For each instance, there are two sentences (i.e. sentence 1 and 2), and its corresponding similarity score (the mean of the scores assigned by the five human annotators).

{
    "id": "0",
    "sentence1": "Centrosomes increase both in size and in microtubule-nucleating capacity just before mitotic entry.", 
    "sentence2": "Functional studies showed that, when introduced into cell lines, miR-146a was found to promote cell proliferation in cervical cancer cells, which suggests that miR-146a works as an oncogenic miRNA in these cancers.",
    "score": 0.0
}

Data Fields

  • sentence 1 : string
  • sentence 2 : string
  • score : float ranging from 0 (no relation) to 4 (equivalent)

Dataset Creation

Curation Rationale

Source Data

The TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset .

Annotations

Annotation process

The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores ranging from 0 (no relation) to 4 (equivalent). The score range was described based on the guidelines of SemEval 2012 Task 6 on STS (Agirre et al., 2012). Besides the annotation instructions, example sentences from the biomedical literature were provided to the annotators for each of the similarity degrees.

The table below shows the Pearson correlation of the scores of each annotator with respect to the average scores of the remaining four annotators. It is observed that there is strong association among the scores of the annotators. The lowest correlations are 0.902, which can be considered as an upper bound for an algorithmic measure evaluated on this dataset.

Correlation r
Annotator A 0.952
Annotator B 0.958
Annotator C 0.917
Annotator D 0.902
Annotator E 0.941

Additional Information

Dataset Curators

  • Gizem Soğancıoğlu, gizemsogancioglu@gmail.com
  • Hakime Öztürk, hakime.ozturk@boun.edu.tr
  • Arzucan Özgür, gizemsogancioglu@gmail.com Bogazici University, Istanbul, Turkey

Licensing Information

BIOSSES is made available under the terms of The GNU Common Public License v.3.0 .

Citation Information

@article{10.1093/bioinformatics/btx238,
    author = {Soğancıoğlu, Gizem and Öztürk, Hakime and Özgür, Arzucan},
    title = "{BIOSSES: a semantic sentence similarity estimation system for the biomedical domain}",
    journal = {Bioinformatics},
    volume = {33},
    number = {14},
    pages = {i49-i58},
    year = {2017},
    month = {07},
    abstract = "{The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6\\% in terms of the Pearson correlation metric.A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/.}",
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btx238},
    url = {https://doi.org/10.1093/bioinformatics/btx238},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/33/14/i49/25157316/btx238.pdf},
}

Contributions

Thanks to @qanastek for adding this dataset.