VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks... (read more)

PDF Abstract

Results from the Paper


 Ranked #1 on Action Classification on Moments in Time (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Audio Classification AudioSet VATT-Base Test mAP 0.394 # 2
AUC 0.971 # 1
d-prime 2.895 # 1
Action Classification Kinetics-400 VATT-Large Vid acc@1 82.1 # 6
Vid acc@5 95.5 # 2
Action Classification Kinetics-600 VATT-Large Top-1 Accuracy 83.6 # 5
Top-5 Accuracy 96.6 # 1
Action Classification Moments in Time VATT-Large Top 1 Accuracy 41.1 # 1
Top 5 Accuracy 67.7 # 1

Methods used in the Paper


METHOD TYPE
Softmax
Output Functions
Label Smoothing
Regularization
Layer Normalization
Normalization
Residual Connection
Skip Connections
BPE
Subword Segmentation
Multi-Head Attention
Attention Modules
Adam
Stochastic Optimization
Dropout
Regularization
Dense Connections
Feedforward Networks
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers
Vision Transformer
Image Models