Visual Question Answering (VQA)

726 papers with code • 61 benchmarks • 110 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source:


Use these libraries to find Visual Question Answering (VQA) models and implementations

Most implemented papers

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

ramprs/grad-cam ICCV 2017

For captioning and VQA, we show that even non-attention based models can localize inputs.

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

peteanderson80/bottom-up-attention CVPR 2018

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

ParlAI: A Dialog Research Software Platform

facebookresearch/ParlAI EMNLP 2017

We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.

VQA: Visual Question Answering

ramprs/grad-cam ICCV 2015

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

A simple neural network module for relational reasoning

kimhc6028/relational-networks NeurIPS 2017

Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn.

Stacked Attention Networks for Image Question Answering

zcyang/imageqa-san CVPR 2016

Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively.

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

Cyanogenoid/pytorch-vqa 11 Apr 2017

This paper presents a new baseline for visual question answering task.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Dynamic Memory Networks for Visual and Textual Question Answering

therne/dmn-tensorflow 4 Mar 2016

Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.