LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.
Source: LXMERT: Learning Cross-Modality Encoder Representations from TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Question Answering | 18 | 14.75% |
Visual Question Answering | 18 | 14.75% |
Visual Question Answering (VQA) | 18 | 14.75% |
Language Modeling | 5 | 4.10% |
Language Modelling | 5 | 4.10% |
Sentence | 5 | 4.10% |
Retrieval | 5 | 4.10% |
Image-text matching | 4 | 3.28% |
Text Matching | 4 | 3.28% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |