Vision and Language Pre-Trained Models

Learning Cross-Modality Encoder Representations from Transformers

Introduced by Tan et al. in LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.

Source: LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Visual Question Answering 17 16.83%
Visual Question Answering (VQA) 17 16.83%
Question Answering 17 16.83%
Language Modelling 5 4.95%
Sentence 5 4.95%
Retrieval 5 4.95%
Image-text matching 4 3.96%
Text Matching 4 3.96%
Visual Reasoning 4 3.96%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories