数据集:
gretelai/symptom_to_diagnosis
This dataset contains natural language descriptions of symptoms labeled with 22 corresponding diagnoses. Gretel/symptom_to_diagnosis provides 1065 symptom descriptions in the English language labeled with 22 diagnoses, focusing on fine-grained single-domain diagnosis.
Each row contains the following fields:
Example:
{ "output_text": "drug reaction", "input_text": "I've been having headaches and migraines, and I can't sleep. My whole body shakes and twitches. Sometimes I feel lightheaded." }
This table contains the count of each diagnosis in the train and test splits.
Diagnosis | train.jsonl | test.jsonl | |
---|---|---|---|
0 | drug reaction | 40 | 8 |
1 | allergy | 40 | 10 |
2 | chicken pox | 40 | 10 |
3 | diabetes | 40 | 10 |
4 | psoriasis | 40 | 10 |
5 | hypertension | 40 | 10 |
6 | cervical spondylosis | 40 | 10 |
7 | bronchial asthma | 40 | 10 |
8 | varicose veins | 40 | 10 |
9 | malaria | 40 | 10 |
10 | dengue | 40 | 10 |
11 | arthritis | 40 | 10 |
12 | impetigo | 40 | 10 |
13 | fungal infection | 39 | 9 |
14 | common cold | 39 | 10 |
15 | gastroesophageal reflux disease | 39 | 10 |
16 | urinary tract infection | 39 | 9 |
17 | typhoid | 38 | 9 |
18 | pneumonia | 37 | 10 |
19 | peptic ulcer disease | 37 | 10 |
20 | jaundice | 33 | 7 |
21 | migraine | 32 | 10 |
The data is split to 80% train (853 examples, 167kb) and 20% test (212 examples, 42kb).
Data was filtered to remove unwanted categories and updated using an LLM to create language more consistent with how a patient would describe symptoms in natural language to a doctor.
This dataset was adapted based on the Symptom2Disease dataset from Kaggle.
The symptoms in this dataset were modified from their original format using an LLM and do not contain personal data.
This dataset is licensed Apache 2.0 and free for use.