2 code implementations • 23 Oct 2023 • Maryam Hashemi, Ghazaleh Mahmoudi, Sara Kodeiri, Hadi Sheikhi, Sauleh Eetemadi
Large-scale pretrained models such as LXMERT are becoming popular for learning cross-modal representations on text-image pairs for vision-language tasks.
Ranked #1 on
Visual Question Answering
on VQA v2 test-std