Visual Question Answering (VQA)
757 papers with code • 62 benchmarks • 112 datasets
Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.
Image Source: visualqa.org
Libraries
Use these libraries to find Visual Question Answering (VQA) models and implementationsDatasets
Subtasks
Latest papers
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images.
MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models
Mixture of Expert Tuning (MoE-Tuning) has effectively enhanced the performance of general MLLMs with fewer parameters, yet its application in resource-limited medical settings has not been fully explored.
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
OmniFusion Technical Report
We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality.
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.
Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models
In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception.
Evaluating Text-to-Visual Generation with Image-to-Text Generation
For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations.
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models
This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD).
A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions
Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information.
Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering
In this work, we introduce an interpretable approach for graph-based VQA and demonstrate competitive performance on the GQA dataset.