Visual Question Answering (VQA)

758 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

Latest papers with no code

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

no code yet • 27 Mar 2024

Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis.

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

no code yet • 26 Mar 2024

The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI.

Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

no code yet • 25 Mar 2024

In particular, our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset, which needs strong reasoning.

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

no code yet • 20 Mar 2024

Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation.

Multi-Modal Hallucination Control by Visual Information Grounding

no code yet • 20 Mar 2024

In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations.

WoLF: Wide-scope Large Language Model Framework for CXR Understanding

no code yet • 19 Mar 2024

(1) Previous methods solely use CXR reports, which are insufficient for comprehensive Visual Question Answering (VQA), especially when additional health-related data like medication history and prior diagnoses are needed.

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

no code yet • 18 Mar 2024

The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

no code yet • 17 Mar 2024

Two approaches have emerged to input images into large language models (LLMs).

Mitigating Dialogue Hallucination for Large Multi-modal Models via Adversarial Instruction Tuning

no code yet • 15 Mar 2024

To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues generated by our novel Adversarial Question Generator, which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LMMs.

Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

no code yet • 15 Mar 2024

By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels.