Using Interactive Feedback to Improve the Accuracy and Explainability of Question Answering Systems Post-Deployment

Most research on question answering focuses on the pre-deployment stage; i.e., building an accurate model for deployment. In this paper, we ask the question: Can we improve QA systems further \emph{post-}deployment based on user interactions? We focus on two kinds of improvements: 1) improving the QA system's performance itself, and 2) providing the model with the ability to explain the correctness or incorrectness of an answer. We collect a retrieval-based QA dataset, FeedbackQA, which contains interactive feedback from users. We collect this dataset by deploying a base QA system to crowdworkers who then engage with the system and provide feedback on the quality of its answers. The feedback contains both structured ratings and unstructured natural language explanations. We train a neural model with this feedback data that can generate explanations and re-score answer candidates. We show that feedback data not only improves the accuracy of the deployed QA system but also other stronger non-deployed systems. The generated explanations also help users make informed decisions about the correctness of answers. Project page: https://mcgill-nlp.github.io/feedbackqa/

PDF Abstract Findings (ACL) 2022 PDF Findings (ACL) 2022 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Overall - Test FeedbackQA BERT RQA + CombinedReranker Accuracy 67.97 # 1
Overall - Test FeedbackQA BERT RQA + VanillaReranker Accuracy 65.98 # 3
Overall - Test FeedbackQA BERT RQA + FeedbackReranker Accuracy 66.59 # 2
Overall - Test FeedbackQA BERT RQA Accuracy 64.75 # 4

Methods