Few-Shot Multimodal Explanation for Visual Question Answering

ACM MM 2024  ·  Dizhan Xue, Shengsheng Qian, Changsheng Xu ·

A key object in eXplainable Artificial Intelligence (XAI) is to create intelligent systems capable of reasoning and explaining real-world data to facilitate reliable decision-making. Recent studies have acknowledged the importance of providing user-friendly and verifiable explanations to facilitate trustworthy Visual Question Answering (VQA) systems. This paper aims to promote explainable VQA from both data and method perspectives. First, we propose a new Standard Multimodal Explanation (SME) dataset and a new Few-Shot Multimodal Explanation for VQA (FS-MEVQA) task, which aims to generate the multimodal explanation of the underlying reasoning process for solving visual questions with few training samples. Our SME dataset includes 1,028,230 samples composed of questions, images, answers, and multimodal explanations, which can facilitate research in both traditional MEVQA and FS-MEVQA. To the best of our knowledge, this is the first large-scale dataset with joint language-vision explanations based on standard English and additional visual grounding tokens. Second, we propose a training-free Multimodal Explaining Agent (MEAgent) method based on an LLM agent with multimodal open-world tools to infer answers and generate multimodal explanations for visual questions. Our MEAgent can learn multimodal explanation from merely N(=16) training samples and leverage open-world abilities to perform FS-MEVQA on test samples. Comprehensive experimental results evaluated by language quality metrics, visual detection metric, and visual attribution metrics on our SME dataset indicate the superiority of our method for FS-MEVQA. Our code and data are available at https://github.com/LivXue/FS-MEVQA.

PDF Abstract

Datasets


Introduced in the Paper:

SME

Used in the Paper:

Visual Question Answering VQA-E

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
FS-MEVQA SME MEAgent BLEU-4 67.91 # 1
METEOR 50.55 # 1
ROUGE-L 79.41 # 1
CIDEr 510.44 # 1
SPICE 64.09 # 1
Detection 29.09 # 1
ACC 51.45 # 1
#Learning Samples (N) 16 # 1

Methods


No methods listed for this paper. Add relevant methods here