Visual Question Answering

680 papers with code • 20 benchmarks • 21 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Visual Question Answering models and implementations

Most implemented papers

Hierarchical Question-Image Co-Attention for Visual Question Answering

jiasenlu/HieCoAttenVQA NeurIPS 2016

In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

airsplay/lxmert IJCNLP 2019

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

GPT-4 Technical Report

openai/evals Preprint 2023

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

Visual Instruction Tuning

haotian-liu/LLaVA NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

Hadamard Product for Low-rank Bilinear Pooling

jnhwkim/MulLowBiVQA 14 Oct 2016

Bilinear models provide rich representations compared with linear models.

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

peteanderson80/Matterport3DSimulator CVPR 2018

This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.

Bilinear Attention Networks

jnhwkim/ban-vqa NeurIPS 2018

In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.

Simple Baseline for Visual Question Answering

metalbubble/VQAbaseline 7 Dec 2015

We describe a very simple bag-of-words baseline for visual question answering.

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

necla-ml/SNLI-VE CVPR 2017

We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter!

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

sea-snell/implicit-language-q-learning ICCV 2017

Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.