Vision and Language Pre-Trained Models

XGPT is a method of cross-modal generative pre-training for image captioning designed to pre-train text-to-image caption generators through three novel generation tasks, including image-conditioned masked language modeling (IMLM), image-conditioned denoising autoencoding (IDA), and text-conditioned image feature generation (TIGF). The pre-trained XGPT can be fine-tuned without any task-specific architecture modifications and build strong image captioning models.

Source: XGPT: Cross-modal Generative Pre-Training for Image Captioning


Paper Code Results Date Stars


Task Papers Share
Denoising 1 16.67%
Image Captioning 1 16.67%
Image Retrieval 1 16.67%
Language Modelling 1 16.67%
Retrieval 1 16.67%
Visual Question Answering (VQA) 1 16.67%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign