Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance... (read more)

Results in Papers With Code
(↓ scroll down to see all results)