Learning Cross-Modality Encoder Representations from Transformers

Introduced by Tan et al. in LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.

Source: LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Visual Question Answering	17	16.67%
Visual Question Answering (VQA)	17	16.67%
Question Answering	17	16.67%
Language Modelling	5	4.90%
Sentence	5	4.90%
Retrieval	5	4.90%
Image-text matching	4	3.92%
Text Matching	4	3.92%
Visual Reasoning	4	3.92%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models