XGPT is a method of cross-modal generative pre-training for image captioning designed to pre-train text-to-image caption generators through three novel generation tasks, including image-conditioned masked language modeling (IMLM), image-conditioned denoising autoencoding (IDA), and text-conditioned image feature generation (TIGF). The pre-trained XGPT can be fine-tuned without any task-specific architecture modifications and build strong image captioning models.
Source: XGPT: Cross-modal Generative Pre-Training for Image CaptioningPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Denoising | 1 | 16.67% |
Image Captioning | 1 | 16.67% |
Image Retrieval | 1 | 16.67% |
Language Modelling | 1 | 16.67% |
Retrieval | 1 | 16.67% |
Visual Question Answering (VQA) | 1 | 16.67% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |