Co-training Transformer with Videos and Images Improves Action Recognition

14 Dec 2021  ·  BoWen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M. Dai, Ruoming Pang, Fei Sha ·

In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%), Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time (46.1%), with a simple spatio-temporal video transformer.

PDF Abstract

Results from the Paper


Ranked #8 on Action Classification on MiT (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Classification Kinetics-400 CoVeR (JFT-300M) Acc@1 86.3 # 43
Acc@5 97.2 # 29
Action Classification Kinetics-400 CoVeR (JFT-3B) Acc@1 87.2 # 31
Acc@5 97.5 # 23
Action Classification Kinetics-600 CoVeR (JFT-3B) Top-1 Accuracy 87.9 # 23
Top-5 Accuracy 97.8 # 12
Action Classification Kinetics-600 CoVeR (JFT-300M) Top-1 Accuracy 86.8 # 26
Top-5 Accuracy 97.3 # 14
Action Classification Kinetics-700 CoVeR (JFT-300M) Top-1 Accuracy 78.5 # 18
Top-5 Accuracy 94.2 # 9
Action Classification Kinetics-700 CoVeR (JFT-3B) Top-1 Accuracy 79.8 # 15
Top-5 Accuracy 94.9 # 6
Action Classification MiT CoVeR(JFT-300M) Top 1 Accuracy 45.0 # 9
Top 5 Accuracy 73.9 # 5
Action Classification MiT CoVeR(JFT-3B) Top 1 Accuracy 46.1 # 8
Top 5 Accuracy 75.4 # 4
Action Recognition Something-Something V2 CoVeR(JFT-3B) Top-1 Accuracy 70.9 # 33
Top-5 Accuracy 92.5 # 26
Action Recognition Something-Something V2 CoVeR(JFT-300M) Top-1 Accuracy 69.8 # 39
Top-5 Accuracy 91.9 # 31

Methods


No methods listed for this paper. Add relevant methods here