Vision and Language Pre-Trained Models

FLAVA aims at building a single holistic universal model that targets all modalities at once. FLAVA is a language vision alignment model that learns strong representations from multimodal data (image-text pairs) and unimodal data (unpaired images and text). The model consists of an image encode transformer to capture unimodal image representations, a text encoder transformer to process unimodal text information, and a multimodal encode transformer that takes as input the encoded unimodal image and text and integrates their representations for multimodal reasoning. During pretraining, masked image modeling (MIM) and mask language modeling (MLM) losses are applied onto the image and text encoders over a single image or a text piece, respectively, while contrastive, masked multimodal modeling (MMM), and image-text matching (ITM) loss are used over paired image-text data. For downstream tasks, classification heads are applied on the outputs from the image, text, and multimodal encoders respectively for visual recognition, language understanding, and multimodal reasoning tasks It can be applied to broad scope of tasks from three domains (visual recognition, language understanding, and multimodal reasoning) under a common transformer model architecture.

Source: FLAVA: A Foundational Language And Vision Alignment Model

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 3 27.27%
Decision Making 1 9.09%
Continual Learning 1 9.09%
Video Alignment 1 9.09%
Retrieval 1 9.09%
Zero-Shot Learning 1 9.09%
Image Retrieval 1 9.09%
Image-to-Text Retrieval 1 9.09%
Visual Reasoning 1 9.09%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories