VATT

Introduced by Akbari et al. in VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Video-Audio-Text Transformer, or VATT, is a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from BERT and ViT except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks.

VATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.

Source: VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Action Classification	1	9.09%
Action Recognition	1	9.09%
Action Recognition In Videos	1	9.09%
Audio Classification	1	9.09%
General Classification	1	9.09%
Image Classification	1	9.09%
Retrieval	1	9.09%
Self-Supervised Learning	1	9.09%
Temporal Action Localization	1	9.09%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Vision Transformer	Image Models

Categories

Add Remove

Vision Transformers

Multi-Modal Methods