数据集:
medical_questions_pairs
任务:
文本分类语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
other批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2008.13546许可:
license:unknownThis dataset consists of 3048 similar and dissimilar medical question pairs hand-generated and labeled by Curai's doctors. Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap . Each question results in one similar and one different pair through the following instructions provided to the labelers:
The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial.
The text in the dataset is in English.
The dataset contains dr_id, question_1, question_2, label. 11 different doctors were used for this task so dr_id ranges from 1 to 11. The label is 1 if the question pair is similar and 0 otherwise.
The dataset as of now consists of only one split(train) but can be split seperately based on the requirement
train | |
---|---|
Non similar Question Pairs | 1524 |
Similar Question Pairs | 1524 |
Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap . Each question results in one similar and one different pair through the following instructions provided to the labelers:
The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial.
[More Information Needed]
1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation processDoctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap . Each question results in one similar and one different pair through the following instructions provided to the labelers:
The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial.
Who are the annotators?Curai's doctors
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@misc{mccreery2020effective, title={Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs}, author={Clara H. McCreery and Namit Katariya and Anitha Kannan and Manish Chablani and Xavier Amatriain}, year={2020}, eprint={2008.13546}, archivePrefix={arXiv}, primaryClass={cs.IR} }
Thanks to @tuner007 for adding this dataset.