Visual Commonsense Reasoning

29 papers with code • 7 benchmarks • 7 datasets

Latest papers with no code

Making Large Multimodal Models Understand Arbitrary Visual Prompts

no code yet • 1 Dec 2023

Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain.

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

no code yet • 9 Nov 2023

Further, we design two pre-training tasks named object position regression (OPR) and spatial relation classification (SRC) to learn to reconstruct the spatial relation graph respectively.

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

no code yet • 9 Oct 2023

We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI).

Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning

no code yet • 26 May 2023

Learning to infer labels in an open world, i. e., in an environment where the target ``labels'' are unknown, is an important characteristic for achieving autonomy.

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

no code yet • 24 May 2023

Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks.

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

no code yet • 10 Apr 2023

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

no code yet • ICCV 2023

We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.

Learning to Agree on Vision Attention for Visual Commonsense Reasoning

no code yet • 4 Feb 2023

Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning.

Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning

no code yet • 30 Jan 2023

On the other hand, BLIP-2 as an MLLM is employed to process images and texts, and the referring expressions in texts involving specific visual objects are modified with linguistic object labels to serve as comprehensible MLLM inputs.

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.