RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports

We present a radiology question answering dataset, RadQA, with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines. We conduct a thorough analysis of the proposed dataset by examining the broad categories of disagreement in annotation (providing insights on the errors made by humans) and the reasoning requirements to answer a question (uncovering the huge dependence on medical knowledge for answering the questions). The advanced transformer language models achieve the best F1 score of 63.55 on the test set, however, the best human performance is 90.31 (with an average of 84.52). This demonstrates the challenging nature of RadQA that leaves ample scope for future method research.

PDF Abstract

Datasets


Introduced in the Paper:

RadQA

Used in the Paper:

SQuAD MIMIC-III emrQA

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Reading Comprehension RadQA BERT pretrained on MIMIC-III Answer F1 63.55 # 1

Methods


No methods listed for this paper. Add relevant methods here