This repository contains a dataset of 3048 similar and dissimilar medical question pairs hand-generated and labeled by Curai's doctors. The dataset is described in detail in our paper.
We present our doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap. Each question results in one similar and one different pair through the following instructions provided to the labelers:
The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, we intentionally frame the task such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial.
The dataset is formatted as dr_id, question_1, question_2, label
. We used 11 different doctors for this task so dr_id
ranges from 1 to 11. The label is 1 if the question pair is similar and 0 otherwise.
The final dataset contains 4567 unique questions. The minimum, maximum, median and average number of tokens in these questions are 4, 81, 20 and 22.675 respectively showing there is reasonable variance in the length of the questions. The shortest question is Are fibroadenomas malignant?
An off-the-shelf medical entity recognizer finds around 1000 unique medical entities in the questions. Some of the top entity mentions were: physician, pregnancy, pain, lasting weeks, menstruation, emotional state, cancer, visual function, headache, bleeding, fever, sexual intercourse
Paper | Code | Results | Date | Stars |
---|