FS-MEVQA

7 papers with code • 1 benchmarks • 1 datasets

The Few-Shot Multimodal Explanation for Visual Question Answering (FS-MEVQA) task aims to learn MEVQA from few training samples.

Datasets


Most implemented papers

GPT-4 Technical Report

openai/evals Preprint 2023

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

CogVLM: Visual Expert for Pretrained Language Models

thudm/cogvlm 6 Nov 2023

We introduce CogVLM, a powerful open-source visual language foundation model.

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

qwenlm/qwen-vl 24 Aug 2023

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images.

REX: Reasoning-aware and Grounded Explanation

szzexpoi/rex CVPR 2022

Finally, with our new data and method, we perform extensive analyses to study the effectiveness of our explanation under different settings, including multi-task learning and transfer learning.

Variational Causal Inference Network for Explanatory Visual Question Answering

LivXue/VCIN ICCV 2023

To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations.

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

dlvuldet/primevul 8 Mar 2024

In this report, we introduce the Gemini 1. 5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio.

Few-Shot Multimodal Explanation for Visual Question Answering

LivXue/FS-MEVQA ACM MM 2024

First, we propose a new Standard Multimodal Explanation (SME) dataset and a new Few-Shot Multimodal Explanation for VQA (FS-MEVQA) task, which aims to generate the multimodal explanation of the underlying reasoning process for solving visual questions with few training samples.