Visual Commonsense Reasoning
29 papers with code • 7 benchmarks • 7 datasets
Image source: Visual Commonsense Reasoning
Datasets
Latest papers with no code
Making Large Multimodal Models Understand Arbitrary Visual Prompts
Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain.
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Further, we design two pre-training tasks named object position regression (OPR) and spatial relation classification (SRC) to learn to reconstruct the spatial relation graph respectively.
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI).
Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning
Learning to infer labels in an open world, i. e., in an environment where the target ``labels'' are unknown, is an important characteristic for achieving autonomy.
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions
Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks.
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning.
Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning
On the other hand, BLIP-2 as an MLLM is employed to process images and texts, and the referring expressions in texts involving specific visual objects are modified with linguistic object labels to serve as comprehensible MLLM inputs.
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.