no code implementations • 25 Mar 2024 • Zhuowan Li, Bhavan Jasani, Peng Tang, Shabnam Ghadar
In particular, our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset, which needs strong reasoning.
1 code implementation • 15 Nov 2022 • Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos
We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task.
1 code implementation • ICCV 2021 • Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha
DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer.
Ranked #3 on Document Image Classification on RVL-CDIP
no code implementations • 26 Nov 2019 • Bhavan Jasani, Afshaan Mazagonwalla
In this work, we present a body pose based zero shot action recognition network and demonstrate its performance on the NTU RGB-D dataset.
no code implementations • 8 Nov 2019 • Bhavan Jasani, Rohit Girdhar, Deva Ramanan
Joint vision and language tasks like visual question answering are fascinating because they explore high-level understanding, but at the same time, can be more prone to language biases.
no code implementations • 19 May 2018 • Yash Patel, Kashyap Chitta, Bhavan Jasani
We address the problem of semi-supervised domain adaptation of classification algorithms through deep Q-learning.