ExpMRC: Explainability Evaluation for Machine Reading Comprehension

10 May 2021  ยท  Yiming Cui, Ting Liu, Wanxiang Che, Zhigang Chen, Shijin Wang ยท

Achieving human-level performance on some of Machine Reading Comprehension (MRC) datasets is no longer challenging with the help of powerful Pre-trained Language Models (PLMs). However, it is necessary to provide both answer prediction and its explanation to further improve the MRC system's reliability, especially for real-life applications. In this paper, we propose a new benchmark called ExpMRC for evaluating the explainability of the MRC systems. ExpMRC contains four subsets, including SQuAD, CMRC 2018, RACE$^+$, and C$^3$ with additional annotations of the answer's evidence. The MRC systems are required to give not only the correct answer but also its explanation. We use state-of-the-art pre-trained language models to build baseline systems and adopt various unsupervised approaches to extract evidence without a human-annotated training set. The experimental results show that these models are still far from human performance, suggesting that the ExpMRC is challenging. Resources will be available through https://github.com/ymcui/expmrc

PDF Abstract

Datasets


Introduced in the Paper:

ExpMRC

Used in the Paper:

SQuAD HotpotQA RACE CMRC CMRC 2018 C3
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Multi-Choice MRC ExpMRC - C3 (test) MacBERT-large + MSS w/ Ques. (single model) Answer F1 72.000 # 3
Evidence F1 58.400 # 3
Overall F1 46.000 # 3
Multi-Choice MRC ExpMRC - C3 (test) MacBERT-base + MSS w/ Ques. (single model) Answer F1 66.800 # 5
Evidence F1 57.400 # 5
Overall F1 42.300 # 4
Multi-Choice MRC ExpMRC - C3 (test) MacBERT-large + Pseudo-data (single model) Answer F1 74.400 # 2
Evidence F1 59.900 # 2
Overall F1 47.300 # 2
Multi-Choice MRC ExpMRC - C3 (test) Human Performance Answer F1 94.3 # 1
Evidence F1 97.7 # 1
Overall F1 90.0 # 1
Multi-Choice MRC ExpMRC - C3 (test) MacBERT-base + Pseudo-data (single model) Answer F1 69.000 # 4
Evidence F1 57.500 # 4
Overall F1 40.600 # 5
Span-Extraction MRC ExpMRC - CMRC (test) MacBERT-large + MSS w/ Ques. (single model) Answer F1 88.600 # 2
Evidence F1 71.000 # 2
Overall F1 63.200 # 3
Span-Extraction MRC ExpMRC - CMRC (test) MacBERT-base + MSS w/ Ques. (single model) Answer F1 84.400 # 4
Evidence F1 69.800 # 4
Overall F1 59.900 # 4
Span-Extraction MRC ExpMRC - CMRC (test) Human Performance Answer F1 97.9 # 1
Evidence F1 94.6 # 1
Overall F1 92.6 # 1
Span-Extraction MRC ExpMRC - CMRC (test) MacBERT-large + PA Sent. (single model) Answer F1 88.600 # 2
Evidence F1 70.600 # 3
Overall F1 63.300 # 2
Span-Extraction MRC ExpMRC - CMRC (test) MacBERT-base + PA Sent. (single model) Answer F1 84.400 # 4
Evidence F1 69.100 # 5
Overall F1 59.800 # 5
Multi-Choice MRC ExpMRC - RACE+ (test) BERT-base + MSS w/ Ques. (single model) Answer F1 59.800 # 5
Evidence F1 41.800 # 4
Overall F1 27.300 # 4
Multi-Choice MRC ExpMRC - RACE+ (test) BERT-base + Pseudo-data (single model) Answer F1 60.100 # 4
Evidence F1 43.500 # 2
Overall F1 27.100 # 5
Multi-Choice MRC ExpMRC - RACE+ (test) Human Performance Answer F1 93.6 # 1
Evidence F1 90.5 # 1
Overall F1 84.4 # 1
Multi-Choice MRC ExpMRC - RACE+ (test) BERT-large + Pseudo-data (single model) Answer F1 70.400 # 2
Evidence F1 41.300 # 5
Overall F1 30.800 # 3
Multi-Choice MRC ExpMRC - RACE+ (test) BERT-large + MSS w/ Ques. (single model) Answer F1 68.100 # 3
Evidence F1 42.500 # 3
Overall F1 31.300 # 2
Span-Extraction MRC ExpMRC - SQuAD (test) BERT-large + MSS (single model) Answer F1 92.300 # 1
Evidence F1 85.700 # 3
Overall F1 80.400 # 3
Span-Extraction MRC ExpMRC - SQuAD (test) BERT-large + PA Sent. (single model) Answer F1 92.300 # 1
Evidence F1 89.600 # 1
Overall F1 83.600 # 2
Span-Extraction MRC ExpMRC - SQuAD (test) Human Performance Answer F1 91.3 # 3
Overall F1 84.7 # 1
Span-Extraction MRC ExpMRC - SQuAD (test) BERT-base + PA Sent. (single model) Answer F1 87.100 # 4
Evidence F1 89.100 # 2
Overall F1 79.600 # 4
Span-Extraction MRC ExpMRC - SQuAD (test) BERT-base + MSS (single model) Answer F1 87.100 # 4
Evidence F1 85.400 # 4
Overall F1 76.100 # 5

Methods


No methods listed for this paper. Add relevant methods here