Visual Question Answering
922 papers with code • 27 benchmarks • 31 datasets
MLLM Leaderboard
Libraries
Use these libraries to find Visual Question Answering models and implementationsDatasets
Subtasks
Most implemented papers
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
For captioning and VQA, we show that even non-attention based models can localize inputs.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
VQA: Visual Question Answering
Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
A simple neural network module for relational reasoning
Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering
This paper presents a new baseline for visual question answering task.
Dynamic Memory Networks for Visual and Textual Question Answering
Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
GPT-4 Technical Report
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
Visual Instruction Tuning
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.