Visual Reasoning

207 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Libraries

Use these libraries to find Visual Reasoning models and implementations
3 papers
8,400
3 papers
32
See all 7 libraries.

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

fartashf/vsepp 18 Jul 2017

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

Compositional Attention Networks for Machine Reasoning

stanfordnlp/mac-network ICLR 2018

We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

airsplay/lxmert IJCNLP 2019

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

VisualBERT: A Simple and Performant Baseline for Vision and Language

uclanlp/visualbert 9 Aug 2019

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

VinVL: Revisiting Visual Representations in Vision-Language Models

pzzhang/VinVL CVPR 2021

In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Visual Instruction Tuning

haotian-liu/LLaVA NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.