To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets.
Other results reveal that our probing questions can help to improve the performance of the models (e. g., by +10. 3 F1) on the main QA task and our dataset can be used for data augmentation to improve the robustness of the models.
The issue of shortcut learning is widely known in NLP and has been an important research focus in recent years.
The evidence information has two benefits: (i) providing a comprehensive explanation for predictions and (ii) evaluating the reasoning skills of a model.