Uncertainty-aware Action Decoupling Transformer for Action Anticipation
Human action anticipation aims at predicting what people will do in the future based on past observations. In this paper we introduce Uncertainty-aware Action Decoupling Transformer (UADT) for action anticipation. Unlike existing methods that directly predict action in a verb-noun pair format we decouple the action anticipation task into verb and noun anticipations separately. The objective is to make the two decoupled tasks assist each other and eventually improve the action anticipation task. Specifically we propose a two-stream Transformer-based architecture which is composed of a verb-to-noun model and a noun-to-verb model. The verb-to-noun model leverages the verb information to improve the noun prediction and the other way around. We extend the model in a probabilistic manner and quantify the predictive uncertainty of each decoupled task to select features. In this way the noun prediction leverages the most informative and redundancy-free verb features and verb prediction works similarly. Finally the two streams are combined dynamically based on their uncertainties to make the joint action anticipation. We demonstrate the efficacy of our method by achieving state-of-the-art performance on action anticipation benchmarks including EPIC-KITCHENS EGTEA Gaze+ and 50-Salads.
PDF AbstractTasks
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Action Anticipation | 50-Salads | UADT | Top-1 Accuracy | 62.7 | # 1 | |
Action Anticipation | EGTEA | UADT | Top-1 Accuracy | 68.4 | # 1 | |
Action Anticipation | EPIC-KITCHENS-100 | UADT | Recall@5 | 23.0 | # 3 | |
Top-5 Verb | 43.5 | # 2 | ||||
Top-5 Noun | 46.6 | # 1 |