WenLan

Introduced by Huo et al. in WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. A cross-modal pre-training model is defined based on the image-text retrieval task. The main goal is thus to learn two encoders that can embed image and text samples into the same space for effective image-text retrieval. To enforce such cross-modal embedding learning, we introduce contrastive learning with the InfoNCE loss into the BriVL model. Given text embedding, the learning objective aims to find the best image embedding from a batch of image embeddings. Similarly, for a given image embedding, the learning objective is to find the best text embedding from a batch of text embeddings. The pre-training model learns a cross-modal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the image and text embeddings of the true pair for each sample in the batch while minimizing the cosine similarity of the embeddings of the other incorrect pairs.

Source: WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Retrieval	3	23.08%
Retrieval	3	23.08%
Benchmarking	1	7.69%
Image Classification	1	7.69%
Cross-Modal Retrieval	1	7.69%
Language Modelling	1	7.69%
Text Classification	1	7.69%
Image Captioning	1	7.69%
Image-to-Text Retrieval	1	7.69%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models