Visual Commonsense Reasoning
33 papers with code • 7 benchmarks • 8 datasets
Image source: Visual Commonsense Reasoning
Most implemented papers
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
From Recognition to Cognition: Visual Commonsense Reasoning
While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence.
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Unifying Vision-and-Language Tasks via Text Generation
On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.
Think Visually: Question Answering through Virtual Imagery
In this paper, we study the problem of geometric reasoning in the context of question-answering.