CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics. The CLEVR-X dataset is publicly available at \url{https://explainableml.github.io/CLEVR-X/}.
PDF AbstractCode
Datasets
Introduced in the Paper:
CLEVR-XUsed in the Paper:
Visual Question Answering CLEVR Visual Question Answering v2.0 SNLI-VE VQA-E e-SNLI-VETask | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Explanation Generation | CLEVR-X | PJ-X | B4 | 87.4 | # 1 | |
M | 58.9 | # 1 | ||||
RL | 93.4 | # 1 | ||||
C | 639.8 | # 1 | ||||
Acc | 63.0 | # 2 | ||||
Explanation Generation | CLEVR-X | FM | B4 | 78.8 | # 2 | |
M | 52.5 | # 2 | ||||
RL | 85.8 | # 2 | ||||
C | 566.8 | # 2 | ||||
Acc | 80.3 | # 1 |