1 code implementation • 19 Apr 2023 • Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan, Hongliang Ren
Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image.