Visual Question Answering

324 papers with code • 42 benchmarks • 71 datasets

Visual Question Answering is a semantic task that aims to answer questions based on an image.

Image Source: visualqa.org

Greatest papers with code

Alignment Attention by Matching Key and Query Distributions

huggingface/transformers NeurIPS 2021

The neural attention mechanism has been incorporated into deep neural networks to achieve state-of-the-art performance in various domains.

Graph Attention Language understanding +2

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

huggingface/transformers IJCNLP 2019

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Fine-tuning Language Modelling +3

ParlAI: A Dialog Research Software Platform

facebookresearch/ParlAI EMNLP 2017

We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.

Visual Question Answering

Hadamard Product for Low-rank Bilinear Pooling

facebookresearch/ParlAI 14 Oct 2016

Bilinear models provide rich representations compared with linear models.

Visual Question Answering

Ludwig: a type-based declarative deep learning toolbox

uber/ludwig 17 Sep 2019

In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code.

Image Captioning Image Classification +12

Towards VQA Models That Can Read

facebookresearch/pythia CVPR 2019

We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset.

Visual Question Answering

Pythia v0.1: the Winning Entry to the VQA Challenge 2018

facebookresearch/pythia 26 Jul 2018

We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.

Data Augmentation Fine-tuning +1

Bilinear Attention Networks

facebookresearch/pythia NeurIPS 2018

In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.

Visual Question Answering

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

facebookresearch/pythia CVPR 2018

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

Image Captioning Visual Question Answering

MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

facebookresearch/mmf ICLR 2021

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e. g. a question or a category).

Object Counting Question Answering +1