Visual Question Answering (VQA)

758 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions

riken-grp/gazevqa 26 Mar 2024

Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information.

6
26 Mar 2024

Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering

digitalphonetics/intrinsic-subgraph-generation-for-vqa 26 Mar 2024

In this work, we introduce an interpretable approach for graph-based VQA and demonstrate competitive performance on the GQA dataset.

5
26 Mar 2024

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

csebuetnlp/illusionvqa 23 Mar 2024

GPT4V, the best-performing VLM, achieves 62. 99% accuracy (4-shot) on the comprehension task and 49. 7% on the localization task (4-shot and Chain-of-Thought).

3
23 Mar 2024

MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

biomedia-mbzuai/medpromptx 22 Mar 2024

Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR).

42
22 Mar 2024

Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering

bowen-upenn/Multi-Agent-VQA 21 Mar 2024

This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks.

0
21 Mar 2024

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

mlvlab/vid-tldr 20 Mar 2024

To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.

12
20 Mar 2024

VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning

ys-zong/vl-icl 19 Mar 2024

Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding.

9
19 Mar 2024

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

FrankZxShen/ATS 14 Mar 2024

Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.

2
14 Mar 2024

Multi-modal Auto-regressive Modeling via Visual Words

pengts/vw-lmm 12 Mar 2024

Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities.

14
12 Mar 2024

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

yuliang-liu/monkey 7 Mar 2024

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.

1,357
07 Mar 2024