Learning Latent Super-Events to Detect Multiple Activities in Videos

CVPR 2018  ·  AJ Piergiovanni, Michael S. Ryoo ·

In this paper, we introduce the concept of learning latent super-events from activity videos, and present how it benefits activity detection in continuous videos. We define a super-event as a set of multiple events occurring together in videos with a particular temporal organization; it is the opposite concept of sub-events. Real-world videos contain multiple activities and are rarely segmented (e.g., surveillance videos), and learning latent super-events allows the model to capture how the events are temporally related in videos. We design temporal structure filters that enable the model to focus on particular sub-intervals of the videos, and use them together with a soft attention mechanism to learn representations of latent super-events. Super-event representations are combined with per-frame or per-segment CNNs to provide frame-level annotations. Our approach is designed to be fully differentiable, enabling end-to-end learning of latent super-event representations jointly with the activity detector using them. Our experiments with multiple public video datasets confirm that the proposed concept of latent super-event learning significantly benefits activity detection, advancing the state-of-the-arts.

PDF Abstract CVPR 2018 PDF CVPR 2018 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Detection Charades Super-events (RGB+Flow) mAP 19.41 # 14
Action Detection Multi-THUMOS I3D + our super-event mAP 36.4 # 6

Methods


No methods listed for this paper. Add relevant methods here