Vision and Language Pre-Trained Models

Proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. A cross-modal pre-training model is defined based on the image-text retrieval task. The main goal is thus to learn two encoders that can embed image and text samples into the same space for effective image-text retrieval. To enforce such cross-modal embedding learning, we introduce contrastive learning with the InfoNCE loss into the BriVL model. Given text embedding, the learning objective aims to find the best image embedding from a batch of image embeddings. Similarly, for a given image embedding, the learning objective is to find the best text embedding from a batch of text embeddings. The pre-training model learns a cross-modal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the image and text embeddings of the true pair for each sample in the batch while minimizing the cosine similarity of the embeddings of the other incorrect pairs.

Source: WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training


Paper Code Results Date Stars


Task Papers Share
Image Retrieval 3 23.08%
Retrieval 3 23.08%
Benchmarking 1 7.69%
Image Classification 1 7.69%
Cross-Modal Retrieval 1 7.69%
Language Modelling 1 7.69%
Text Classification 1 7.69%
Image Captioning 1 7.69%
Image-to-Text Retrieval 1 7.69%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign