ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers... (read more)

PDF Abstract

Results from the Paper


 Ranked #1 on Action Classification on Kinetics-600 (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Action Recognition EPIC-KITCHENS-100 ViViT-L/16x2 Fact. encoder Action@1 44.0 # 4
Verb@1 66.4 # 5
Noun@1 56.8 # 2
Action Classification Kinetics-400 ViViT-H/16x2 (JFT) Vid acc@1 84.8 # 1
Vid acc@5 95.8 # 1
Action Classification Kinetics-400 ViViT-L/16x2 320 Vid acc@1 81.3 # 8
Vid acc@5 94.7 # 9
Action Classification Kinetics-600 ViViT-L/16x2 (320x320) Top-1 Accuracy 83.0 # 8
Top-5 Accuracy 95.7 # 7
Action Classification Kinetics-600 ViViT-L/16x2 (JFT) Top-1 Accuracy 84.3 # 3
Top-5 Accuracy 96.2 # 5
Action Classification Kinetics-600 ViViT-H/16x2 (JFT) Top-1 Accuracy 85.8 # 1
Top-5 Accuracy 96.5 # 2
Action Classification Kinetics-600 ViViT-L/16x2 Top-1 Accuracy 82.5 # 10
Top-5 Accuracy 95.6 # 10
Action Classification Moments in Time ViViT-L/16x2 Top 1 Accuracy 38.0 # 4
Top 5 Accuracy 64.9 # 2
Action Recognition Something-Something V2 ViViT-L/16x2 Fact. encoder Top-1 Accuracy 65.4 # 13
Top-5 Accuracy 89.8 # 14

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet