Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

1 Dec 2020  ·  Youngwan Lee, Hyung-Il Kim, Kimin Yun, Jinyoung Moon ·

Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named, D(2+1)D that decomposes a 3D depthwise convolution into two spatial and temporal depthwise convolutions for making our network more lightweight and efficient. By using the proposed temporal modeling method (T-OSA), and the efficient factorized component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both Something-Something and Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a state-of-the-art efficient 3D architecture, X3D having comparable model capacity. We hope that VoV3D can serve as a baseline for efficient video classification.

PDF Abstract

Results from the Paper


Ranked #28 on Action Recognition on Something-Something V1 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition Something-Something V1 VoV3D-L (32frames, Kinetics pretrained, single) Top 1 Accuracy 54.59 # 28
Top 5 Accuracy 82.30 # 18
Param. 5.8M # 70
GFLOPs 20.9x6 # 1
Action Recognition Something-Something V1 VoV3D-L (16frames, from scratch, single) Top 1 Accuracy 49.5 # 54
Top 5 Accuracy 78.0 # 31
Param. 5.8M # 70
GFLOPs 9.3x6 # 1
Action Recognition Something-Something V1 VoV3D-M (32frames, from scratch, single) Top 1 Accuracy 49.8 # 51
Top 5 Accuracy 78.0 # 31
Param. 3.3M # 67
GFLOPs 11.5x6 # 1
Action Recognition Something-Something V1 VoV3D-L (32frames, from scratch, single) Top 1 Accuracy 50.6 # 49
Top 5 Accuracy 78.7 # 27
Param. 5.8M # 70
GFLOPs 20.9x6 # 1
Action Recognition Something-Something V1 VoV3D-M (32frames, Kinetics pretrained, single) Top 1 Accuracy 52.68 # 39
Top 5 Accuracy 80.43 # 24
Param. 3.3M # 67
GFLOPs 11.5x6 # 1
Action Recognition Something-Something V1 VoV3D-M (16frames, from scratch, single) Top 1 Accuracy 48.1 # 61
Top 5 Accuracy 76.9 # 34
Param. 3.3M # 67
GFLOPs 5.7x6 # 1
Action Recognition Something-Something V2 VoV3D-L (16frames, from scratch, single) Top-1 Accuracy 64.1 # 94
Top-5 Accuracy 88.6 # 77
Parameters 5.8M # 2
GFLOPs 9.3x6 # 6
Action Recognition Something-Something V2 VoV3D-M (16frames, from scratch, single) Top-1 Accuracy 63.2 # 97
Top-5 Accuracy 88.2 # 79
Parameters 3.3M # 8
GFLOPs 5.7x6 # 6
Action Recognition Something-Something V2 VoV3D-M (32frames, from scratch, single) Top-1 Accuracy 64.2 # 93
Top-5 Accuracy 88.8 # 76
Parameters 3.3M # 8
GFLOPs 11.5x6 # 6
Action Recognition Something-Something V2 VoV3D-L (32frames, from scratch, single) Top-1 Accuracy 65.8 # 82
Top-5 Accuracy 89.5 # 68
Parameters 5.8M # 2
GFLOPs 20.9x6 # 6
Action Recognition Something-Something V2 VoV3D-M (32frames, Kinetics pretrained, single) Top-1 Accuracy 65.24 # 87
Top-5 Accuracy 89.48 # 69
Parameters 3.3M # 8
GFLOPs 11.5x6 # 6
Action Recognition Something-Something V2 VoV3D-L (32frames, Kinetics pretrained, single) Top-1 Accuracy 67.35 # 63
Top-5 Accuracy 90.50 # 54
Parameters 5.8M # 2
GFLOPs 20.9x6 # 6

Methods