数据集:
ncbi_disease
任务:
标记分类语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
license:unknownThis dataset contains the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.
Named Entity Recognition: Leaderboard
The text in the dataset is in English. The associated BCP-47 code is en .
Instances of the dataset contain an array of tokens , ner_tags and an id . An example of an instance of the dataset:
{ 'tokens': ['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.'], 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0], 'id': '0' }
The data is split into a train (5433 instances), validation (924 instances) and test set (941 instances).
The goal of the dataset consists on improving the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.
The dataset consists on abstracts extracted from PubMed.
Who are the source language producers?The source language producers are the authors of publication abstracts hosted in PubMed.
Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.
Who are the annotators?The annotator group consisted of 14 people with backgrounds in biomedical informatics research and experience in biomedical text corpus annotation.
[N/A]
Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information. This dataset provides an annotated corpora that can be used to develop highly effective tools to automatically detect central biomedical concepts such as diseases.
To avoid annotator bias, pairs of annotators were chosen randomly for each set, so that each pair of annotators overlapped for at most two sets.
A handful of disease concepts were discovered that were not included in MEDIC. For those, we decided to include the appropriate OMIM identifiers.
In addition, certain disease mentions were found to not be easily represented using the standard categorizations.
Also, each PMID document was pre-annotated using the Inference Method developed for disease name normalization, which properly handles abbreviation recognition, robust string matching, etc. As such, human annotators were given the pre-annotated documents as a starting point and allowed to see each pre-annotation with a computed confidence.
Rezarta Islamaj Doğan, Robert Leaman, Zhiyong Lu
PUBLIC DOMAIN NOTICE This work is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as a United States Government employee and thus cannot be copyrighted within the United States. The data is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the data and its source code, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using it. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material: An improved corpus of disease mentions in PubMed citations http://aclweb.org/anthology-new/W/W12/W12-2411.pdf NCBI Disease Corpus: A Resource for Disease Name Recognition and Normalization http://www.ncbi.nlm.nih.gov/pubmed/24393765 Disease Name Normalization with Pairwise Learning to Rank http://www.ncbi.nlm.nih.gov/pubmed/23969135
@article{dougan2014ncbi, title={NCBI disease corpus: a resource for disease name recognition and concept normalization}, author={Do{\u{g}}an, Rezarta Islamaj and Leaman, Robert and Lu, Zhiyong}, journal={Journal of biomedical informatics}, volume={47}, pages={1--10}, year={2014}, publisher={Elsevier} }
Thanks to @edugp for adding this dataset.