1 code implementation • 19 Oct 2023 • Yuduo Wang, Pedram Ghamisi
In recent years, with the rapid advancement of transformer models, transformer-based multimodal architectures have found wide application in various downstream tasks, including but not limited to Image Captioning, Visual Question Answering (VQA), and Image-Text Generation.