数据集:

biosses

语言:

en

计算机处理:

monolingual

大小:

n<1K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

gpl-3.0
中文

Dataset Card for BIOSSES

Dataset Summary

BIOSSES is a benchmark dataset for biomedical sentence similarity estimation. The dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. The sentence pairs in BIOSSES were selected from citing sentences, i.e. sentences that have a citation to a reference article.

The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores ranging from 0 (no relation) to 4 (equivalent). In the original paper the mean of the scores assigned by the five human annotators was taken as the gold standard. The Pearson correlation between the gold standard scores and the scores estimated by the models was used as the evaluation metric. The strength of correlation can be assessed by the general guideline proposed by Evans (1996) as follows:

  • very strong: 0.80–1.00
  • strong: 0.60–0.79
  • moderate: 0.40–0.59
  • weak: 0.20–0.39
  • very weak: 0.00–0.19

Supported Tasks and Leaderboards

Biomedical Semantic Similarity Scoring.

Languages

English.

Dataset Structure

Data Instances

For each instance, there are two sentences (i.e. sentence 1 and 2), and its corresponding similarity score (the mean of the scores assigned by the five human annotators).

{'sentence 1': 'Here, looking for agents that could specifically kill KRAS mutant cells, they found that knockdown of GATA2 was synthetically lethal with KRAS mutation'
 'sentence 2': 'Not surprisingly, GATA2 knockdown in KRAS mutant cells resulted in a striking reduction of active GTP-bound RHO proteins, including the downstream ROCK kinase'
 'score': 2.2}

Data Fields

  • sentence 1 : string
  • sentence 2 : string
  • score : float ranging from 0 (no relation) to 4 (equivalent)

Data Splits

No data splits provided.

Dataset Creation

Curation Rationale

Source Data

The TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset .

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores ranging from 0 (no relation) to 4 (equivalent). The score range was described based on the guidelines of SemEval 2012 Task 6 on STS (Agirre et al., 2012). Besides the annotation instructions, example sentences from the biomedical literature were provided to the annotators for each of the similarity degrees.

The table below shows the Pearson correlation of the scores of each annotator with respect to the average scores of the remaining four annotators. It is observed that there is strong association among the scores of the annotators. The lowest correlations are 0.902, which can be considered as an upper bound for an algorithmic measure evaluated on this dataset.

Correlation r
Annotator A 0.952
Annotator B 0.958
Annotator C 0.917
Annotator D 0.902
Annotator E 0.941
Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

  • Gizem Soğancıoğlu, gizemsogancioglu@gmail.com
  • Hakime Öztürk, hakime.ozturk@boun.edu.tr
  • Arzucan Özgür, gizemsogancioglu@gmail.com Bogazici University, Istanbul, Turkey

Licensing Information

BIOSSES is made available under the terms of The GNU Common Public License v.3.0 .

Citation Information

@article{souganciouglu2017biosses, title={BIOSSES: a semantic sentence similarity estimation system for the biomedical domain}, author={So{\u{g}}anc{\i}o{\u{g}}lu, Gizem and {"O}zt{"u}rk, Hakime and {"O}zg{"u}r, Arzucan}, journal={Bioinformatics}, volume={33}, number={14}, pages={i49--i58}, year={2017}, publisher={Oxford University Press} }

Contributions

Thanks to @bwang482 for adding this dataset.