Vision Transformers

Video-Audio-Text Transformer, or VATT, is a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from BERT and ViT except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks.

VATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.

Source: VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Papers


Paper Code Results Date Stars

Components


Component Type
Vision Transformer
Image Models

Categories