TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Anticipation	EGTEA	InAViT	Top-1 Accuracy	67.8	# 1
Action Anticipation	EPIC-KITCHENS-100	InAViT	Recall@5	25.89	# 1
Action Anticipation	EPIC-KITCHENS-100 (test)	InAViT	recall@5	23.75	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/interaction-visual-transformer-for-egocentric/action-anticipation-on-egtea)](https://paperswithcode.com/sota/action-anticipation-on-egtea?p=interaction-visual-transformer-for-egocentric)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/interaction-visual-transformer-for-egocentric/action-anticipation-on-epic-kitchens-100)](https://paperswithcode.com/sota/action-anticipation-on-epic-kitchens-100?p=interaction-visual-transformer-for-egocentric)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/interaction-visual-transformer-for-egocentric/action-anticipation-on-epic-kitchens-100-test)](https://paperswithcode.com/sota/action-anticipation-on-epic-kitchens-100-test?p=interaction-visual-transformer-for-egocentric)`

Interaction Region Visual Transformer for Egocentric Action Anticipation

25 Nov 2022 · Debaditya Roy, Ramanathan Rajendiran, Basura Fernando ·

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Anticipation

Human-Object Interaction Detection

Object

Datasets

EPIC-KITCHENS-100

EGTEA

Results from the Paper

Edit

Ranked #1 on Action Anticipation on EGTEA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Anticipation	EGTEA	InAViT	Top-1 Accuracy	67.8	# 1	Compare
Action Anticipation	EPIC-KITCHENS-100	InAViT	Recall@5	25.89	# 1	Compare
Action Anticipation	EPIC-KITCHENS-100 (test)	InAViT	recall@5	23.75	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Interaction Region Visual Transformer for Egocentric Action Anticipation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove