VidTr: Video Transformer Without Convolutions

We introduce Video Transformer (VidTr) with separable-attention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatio-temporal information via stacked attentions and provide better performance with higher efficiency. We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage. We then present VidTr which reduces the memory cost by 3.3$\times$ while keeping the same performance. To further optimize the model, we propose the standard deviation based topK pooling for attention ($pool_{topK\_std}$), which reduces the computation by dropping non-informative features along temporal dimension. VidTr achieves state-of-the-art performance on five commonly used datasets with lower computational requirement, showing both the efficiency and effectiveness of our design. Finally, error analysis and visualization show that VidTr is especially good at predicting actions that require long-term temporal reasoning.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Classification Charades VidTr-L MAP 43.5 # 22
Action Classification Charades En-VidTr-L MAP 47.3 # 15
Action Recognition HMDB-51 VidTr-L Average accuracy of 3 splits 74.4 # 41
Action Classification Kinetics-400 En-VidTr-S Acc@1 79.4 # 103
Acc@5 94 # 80
Action Classification Kinetics-400 En-VidTr-M Acc@1 79.7 # 101
Acc@5 94.2 # 78
Action Classification Kinetics-400 En-VidTr-L Acc@1 80.5 # 89
Acc@5 94.6 # 61
Action Classification Kinetics-700 VidTr-S Top-1 Accuracy 67.3 # 27
Top-5 Accuracy 87.7 # 14
Action Classification Kinetics-700 VidTr-M Top-1 Accuracy 69.5 # 25
Top-5 Accuracy 88.3 # 13
Action Classification Kinetics-700 VidTr-L Top-1 Accuracy 70.2 # 24
Top-5 Accuracy 89 # 12
Action Classification Kinetics-700 En-VidTr-L Top-1 Accuracy 70.8 # 22
Top-5 Accuracy 89.4 # 11
Action Recognition Something-Something V2 VidTr-L Top-1 Accuracy 60.2 # 111
Action Recognition UCF101 VidTr-L 3-fold Accuracy 96.7 # 30

Methods