数据集:

zhengyun21/PMC-Patients

中文

Dataset Card for PMC-Patients

Dataset Summary

PMC-Patients is a first-of-its-kind dataset consisting of 167k patient summaries extracted from case reports in PubMed Central (PMC), 3.1M patient-article relevance and 293k patient-patient similarity annotations defined by PubMed citation graph.

Supported Tasks and Leaderboards

This is purely the patient summary dataset with relational annotations. For ReCDS benchmark, refer to this dataset

Based on PMC-Patients, we define two tasks to benchmark Retrieval-based Clinical Decision Support (ReCDS) systems: Patient-to-Article Retrieval (PAR) and Patient-to-Patient Retrieval (PPR). For details, please refer to our paper and leaderboard .

Languages

English (en).

Dataset Structure

PMC-Paitents_full.json

This file contains all information about patients summaries in PMC-Patients, which is a list of dict with keys:

  • patient_id : string. A continuous id of patients, starting from 0.
  • patient_uid : string. Unique ID for each patient, with format PMID-x, where PMID is the PubMed Identifier of the source article of the patient and x denotes index of the patient in source article.
  • PMID : string. PMID for source article.
  • file_path : string. File path of xml file of source article.
  • title : string. Source article title.
  • patient : string. Patient summary.
  • age : list of tuples. Each entry is in format (value, unit) where value is a float number and unit is in 'year', 'month', 'week', 'day' and 'hour' indicating age unit. For example, [[1.0, 'year'], [2.0, 'month']] indicating the patient is a one-year- and two-month-old infant.
  • gender : 'M' or 'F'. Male or Female.
  • similar_patients : list of string. patient_uid of the similar patients.
  • relevant_articles : list of string. PMID of the relevant articles.

Dataset Creation

If you are interested in the collection of PMC-Patients and reproducing our baselines, please refer to this reporsitory .

Citation Information

If you find PMC-Patients helpful in your research, please cite our work by:

@misc{zhao2023pmcpatients,
      title={PMC-Patients: A Large-scale Dataset of Patient Summaries and Relations for Benchmarking Retrieval-based Clinical Decision Support Systems}, 
      author={Zhengyun Zhao and Qiao Jin and Fangyuan Chen and Tuorui Peng and Sheng Yu},
      year={2023},
      eprint={2202.13876},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}