Visual Relationship Detection
36 papers with code • 5 benchmarks • 5 datasets
Visual relationship detection (VRD) is one newly developed computer vision task aiming to recognize relations or interactions between objects in an image. It is a further learning task after object recognition and is essential for fully understanding images, even the visual world.
Latest papers
Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group.
Video Relationship Detection Using Mixture of Experts
Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization.
Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD).
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions.
Unified Visual Relationship Detection with Vision and Language Models
To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs).
Distance-Aware Occlusion Detection with Focused Attention
In this work, (1) we propose a novel three-decoder architecture as the infrastructure for focused attention; 2) we use the generalized intersection box prediction task to effectively guide our model to focus on occlusion-specific regions; 3) our model achieves a new state-of-the-art performance on distance-aware relationship detection.
Neural Message Passing for Visual Relationship Detection
Visual relationship detection aims to detect the interactions between objects in an image; however, this task suffers from combinatorial explosion due to the variety of objects and interactions.
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.
Representing Prior Knowledge Using Randomly, Weighted Feature Networks for Visual Relationship Detection
Furthermore, background knowledge represented by RWFNs can be used to alleviate the incompleteness of training sets even though the space complexity of RWFNs is much smaller than LTNs (1:27 ratio).
Image Scene Graph Generation (SGG) Benchmark
There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection.