Temporal Aggregate Representations for Long-Range Video Understanding

ECCV 2020  ·  Fadime Sener, Dipika Singhania, Angela Yao ·

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Anticipation Assembly101 TempAgg Actions Recall@5 8.53 # 2
Verbs Recall@5 59.11 # 2
Objects Recall@5 26.27 # 2