数据集:
medical_dialog
任务:
问答子任务:
closed-domain-qa计算机处理:
monolingual大小:
1M<n<10M批注创建人:
found源数据集:
original预印本库:
arxiv:2004.03329许可:
license:unknownThe MedDialog dataset (Chinese) contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances. The data is continuously growing and more dialogues will be added. The raw dialogues are from haodf.com. All copyrights of the data belong to haodf.com.
The MedDialog dataset (English) contains conversations (in English) between doctors and patients. It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.
Directions for using the pre-trained model using BERT using PyTorch is available in the Homepage.
Closed domain qa
Monolingual. The datasets are in English (EN) and Chinese (ZH)
There are 4 configurations:
Each consultation consists of the below:
The dataset is built from icliniq.com , healthcaremagic.com , healthtap.com and all copyrights of the data belong to these websites.
zhEach consultation consists of the below:
The dataset is built from Haodf.com and all copyrights of the data belong to Haodf.com .
One example for chinese is
{ {'dialogue_id': 2, 'dialogue_turns': [{'speaker': '病人', 'utterance': '孩子哭闹时,鸡鸡旁边会肿起,情绪平静时肿块会消失,去一个私人诊所看过,说是疝气.如果确定是疝气,是不是一定要手术治疗?我孩子只有1岁10月,自愈的可能性大吗?如果一定要手术,这么小的孩子风险大吗?术后的恢复困难吗?谢谢.'}, {'speaker': '医生', 'utterance': '南方医的B超说得不清楚,可能是鞘膜积液,可到我医院复查一个B超。'}], 'dialogue_url': 'https://www.haodf.com/doctorteam/flow_team_6477251152.htm', 'file_name': '2020.txt'}, }processed.en
{ 'description': 'throat a bit sore and want to get a good imune booster, especially in light of the virus. please advise. have not been in contact with nyone with the virus.', 'utterances': [ 'patient: throat a bit sore and want to get a good imune booster, especially in light of the virus. please advise. have not been in contact with nyone with the virus.', "doctor: during this pandemic. throat pain can be from a strep throat infection (antibiotics needed), a cold or influenza or other virus, or from some other cause such as allergies or irritants. usually, a person sees the doctor (call first) if the sore throat is bothersome, recurrent, or doesn't go away quickly. covid-19 infections tend to have cough, whereas strep throat usually lacks cough but has more throat pain. (3/21/20)" ] }processed.zh
{ 'utterances': [ '病人:强制性脊柱炎,晚上睡觉翻身时腰骶骨区域疼痛,其他身体任何部位均不疼痛。', '医生:应该没有问题,但最好把图像上传看看。' ] }
For generating the QA only the below fields have been considered:
These are arranged as below in the prepared dataset. Each item will be represented with these parameters.
There are no data splits on the original raw data. The "train" split for each language contains:
For processed configurations, data is split into train, validation and test, with the following number of examples:
train | validation | test | |
---|---|---|---|
processed.en | 482 | 60 | 61 |
processed.zh | 2725989 | 340748 | 340754 |
Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs.
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Unknow.
@article{chen2020meddiag, title={MedDialog: a large-scale medical dialogue dataset}, author={Chen, Shu and Ju, Zeqian and Dong, Xiangyu and Fang, Hongchao and Wang, Sicheng and Yang, Yue and Zeng, Jiaqi and Zhang, Ruisi and Zhang, Ruoyu and Zhou, Meng and Zhu, Penghui and Xie, Pengtao}, journal={arXiv preprint arXiv:2004.03329}, year={2020} }
Thanks to @vrindaprabhu for adding this dataset.