TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Long Term Action Anticipation	Ego4D	ObjectPrompt	ED@20 Action	92.90	# 2
Long Term Action Anticipation	Ego4D	ObjectPrompt	ED@20 Noun	73.96	# 2
Long Term Action Anticipation	Ego4D	ObjectPrompt	ED@20 Verb	72.65	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/object-centric-video-representation-for-long/long-term-action-anticipation-on-ego4d)](https://paperswithcode.com/sota/long-term-action-anticipation-on-ego4d?p=object-centric-video-representation-for-long)`

Object-centric Video Representation for Long-term Action Anticipation

31 Oct 2023 · Ce Zhang, Changcheng Fu, Shijie Wang, Nakul Agarwal, Kwonjoon Lee, Chiho Choi, Chen Sun ·

This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the predictions are longer term, as an observed "background" object could be used by the human actor in the future. We observe that existing object-based video recognition frameworks either assume the existence of in-domain supervised object detectors or follow a fully weakly-supervised pipeline to infer object locations from action labels. We propose to build object-centric video representations by leveraging visual-language pretrained models. This is achieved by "object prompts", an approach to extract task-specific object-centric representations from general-purpose pretrained models without finetuning. To recognize and predict human-object interactions, we use a Transformer-based neural architecture which allows the "retrieval" of relevant objects for action anticipation at various time scales. We conduct extensive evaluations on the Ego4D, 50Salads, and EGTEA Gaze+ benchmarks. Both quantitative and qualitative results confirm the effectiveness of our proposed method.

PDF Abstract