Uncertainty-aware Action Decoupling Transformer for Action Anticipation

Human action anticipation aims at predicting what people will do in the future based on past observations. In this paper we introduce Uncertainty-aware Action Decoupling Transformer (UADT) for action anticipation. Unlike existing methods that directly predict action in a verb-noun pair format we decouple the action anticipation task into verb and noun anticipations separately. The objective is to make the two decoupled tasks assist each other and eventually improve the action anticipation task. Specifically we propose a two-stream Transformer-based architecture which is composed of a verb-to-noun model and a noun-to-verb model. The verb-to-noun model leverages the verb information to improve the noun prediction and the other way around. We extend the model in a probabilistic manner and quantify the predictive uncertainty of each decoupled task to select features. In this way the noun prediction leverages the most informative and redundancy-free verb features and verb prediction works similarly. Finally the two streams are combined dynamically based on their uncertainties to make the joint action anticipation. We demonstrate the efficacy of our method by achieving state-of-the-art performance on action anticipation benchmarks including EPIC-KITCHENS EGTEA Gaze+ and 50-Salads.

PDF Abstract
No code implementations yet. Submit your code now

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Anticipation 50-Salads UADT Top-1 Accuracy 62.7 # 1
Action Anticipation EGTEA UADT Top-1 Accuracy 68.4 # 1
Action Anticipation EPIC-KITCHENS-100 UADT Recall@5 23.0 # 3
Top-5 Verb 43.5 # 2
Top-5 Noun 46.6 # 1

Methods