Visual Reasoning
309 papers with code • 12 benchmarks • 44 datasets
Ability to understand actions and reasoning associated with any visual images
Libraries
Use these libraries to find Visual Reasoning models and implementationsMost implemented papers
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
Visual Instruction Tuning
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.
Compositional Attention Networks for Machine Reasoning
We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.
VisualBERT: A Simple and Performant Baseline for Vision and Language
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.
FiLM: Visual Reasoning with a General Conditioning Layer
We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation.