Vision and Language Pre-Trained Models

Edit

Computer Vision • 31 methods

Involves models that adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks like visual question answering and visual captioning.

According to Du et al. (2022), information coming from the different modalities can be encoded in three ways: fusion encoder, dual encoder, and a combination of both.

References:

Methods

Add a Method

Method	Year	Papers
ALIGN Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	2021	2282
CLIP Learning Transferable Visual Models From Natural Language Supervision	2021	1650
BLIP BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	2022	51
LXMERT LXMERT: Learning Cross-Modality Encoder Representations from Transformers	2019	39
ViLBERT ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks	2019	30
OSCAR Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks	2020	23
VisualBERT VisualBERT: A Simple and Performant Baseline for Vision and Language	2019	22
OFA OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	2022	21
ViLT ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	2021	18
ALBEF Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	2021	14
FLAVA FLAVA: A Foundational Language And Vision Alignment Model	2021	9
Florence Florence: A New Foundation Model for Computer Vision	2021	6
Visual Parsing Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training	2021	5
InternVideo	2000	5
VL-BERT VL-BERT: Pre-training of Generic Visual-Linguistic Representations	2019	4
UNIMO UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning	2020	4
VL-T5 Unifying Vision-and-Language Tasks via Text Generation	2021	4
WenLan WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training	2021	4
SOHO Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning	2021	3
SimVLM SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	2021	3
AltCLIP AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities	2022	3
PLIP Leveraging medical Twitter to build a visual–language foundation model for pathology AI	2023	3
Pixel-BERT Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers	2020	2
Kaleido-BERT	2000	2
FashionCLIP Contrastive language and vision learning of general fashion concepts	2022	2
InterBERT InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining	2020	1
Unified VLP Unified Vision-Language Pre-Training for Image Captioning and VQA	2019	1
XGPT XGPT: Cross-modal Generative Pre-Training for Image Captioning	2020	1
VLMo VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts	2021	1
OneR Unifying Vision-Language Representation Space with Single-tower Transformer	2022	1

Vision and Language Pre-Trained Models Edit

Methods Add a Method

Vision and Language Pre-Trained Models

Edit

Methods

Add a Method