Human-Object Interaction Detection
132 papers with code • 6 benchmarks • 22 datasets
Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels.
Benchmarks
These leaderboards are used to track progress in Human-Object Interaction Detection
Libraries
Use these libraries to find Human-Object Interaction Detection models and implementationsLatest papers
Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection
In addition, these detectors primarily rely on category names and overlook the rich contextual information that language can provide, which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone.
Disentangled Pre-training for Human-Object Interaction Detection
Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem.
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage.
Ins-HOI: Instance Aware Human-Object Interactions Recovery
To address this, we further propose a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances.
EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models
Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected to predict feasible actions, which is a feat inherently achievable by Multimodal Large Language Models (MLLMs).
Instance Tracking in 3D Scenes from Egocentric Videos
We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates.
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models
We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning.
Object-centric Video Representation for Long-term Action Anticipation
To recognize and predict human-object interactions, we use a Transformer-based neural architecture which allows the "retrieval" of relevant objects for action anticipation at various time scales.
Open-Set Image Tagging with Multi-Grained Text Supervision
Specifically, for predefined commonly used tag categories, RAM++ showcases 10. 2 mAP and 15. 4 mAP enhancements over CLIP on OpenImages and ImageNet.
ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial Scenarios
ENIGMA-51 is a new egocentric dataset acquired in an industrial scenario by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e. g., electric screwdriver) and equipments (e. g., oscilloscope).