Therefore, TAVS is distinguished from previous temporal segmentation datasets due to its multi-modal information, holistic view of categories, and hierarchical granularities.
Empirical studies show that some heuristics improve performance while others can make models more brittle or have other side effects.
Specifically, the Object Query would be initialized via category priors represented by an external object detection model to yield better performance.
Moreover, we also use an actor branch to get interaction prediction of the actor and propose a novel composition strategy based on center-point indexing to generate the final HOI prediction.
Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding.
Our challenge includes two tasks: video structuring in the temporal dimension and multi-modal video classification.
Neural networks models have gained unprecedented popularity in natural language processing due to their state-of-the-art performance and the flexible end-to-end training scheme.