Timeception for Complex Action Recognition

This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued. We revisit the conventional definition of activity and restrict it to Complex Action: a set of one-actions with a weak temporal pattern that serves a specific purpose. Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works. As a result, Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS. Further, we demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.

PDF Abstract CVPR 2019 PDF CVPR 2019 Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Classification Breakfast Timeception Accuracy (%) 71.3 # 7
Long-video Activity Recognition Breakfast Timeception (I3D-K400-Pretrain-feature) mAP 61.82 # 7
Action Classification Charades Timeception (R3D) MAP 41.1 # 30
Action Classification Charades Timeception (I3D) MAP 37.2 # 38
Action Classification Charades Timeception (R2D) MAP 31.6 # 41