no code implementations • 5 Mar 2024 • Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering.
no code implementations • 15 Nov 2023 • Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha
Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step.
no code implementations • 15 Nov 2023 • Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay Mahadevan
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models.
no code implementations • 25 Oct 2023 • Yoshinari Fujinuma, Siddharth Varia, Nishant Sankaran, Srikar Appalaraju, Bonan Min, Yogarshi Vyas
Document image classification is different from plain-text document classification and consists of classifying a document by understanding the content and structure of documents such as forms, emails, and other such documents.
1 code implementation • 2 Jun 2023 • Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU).
Ranked #9 on Visual Question Answering (VQA) on DocVQA test (using extra training data)
no code implementations • 7 Feb 2023 • Yash Patel, Yusheng Xie, Yi Zhu, Srikar Appalaraju, R. Manmatha
Instead of purely relying on the alignment from the noisy data, this paper proposes a novel loss function termed SimCon, which accounts for intra-modal similarities to determine the appropriate set of positive samples to align.
1 code implementation • 15 Nov 2022 • Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos
We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task.
1 code implementation • 16 Jun 2022 • Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li, Mu Li
Data augmentation is a necessity to enhance data efficiency in deep learning.
no code implementations • 30 Mar 2022 • Simone Bombari, Alessandro Achille, Zijian Wang, Yu-Xiang Wang, Yusheng Xie, Kunwar Yashraj Singh, Srikar Appalaraju, Vijay Mahadevan, Stefano Soatto
While bounding general memorization can have detrimental effects on the performance of a trained model, bounding RM does not prevent effective learning.
1 code implementation • CVPR 2022 • Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha
Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues.
1 code implementation • ICCV 2021 • Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha
DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer.
Ranked #3 on Document Image Classification on RVL-CDIP
no code implementations • 1 Dec 2020 • Srikar Appalaraju, Yi Zhu, Yusheng Xie, István Fehérvári
Self-supervised representation learning has seen remarkable progress in the last few years.
no code implementations • 12 Feb 2020 • Yash Patel, Srikar Appalaraju, R. Manmatha
The proposed compression model incorporates the salient regions and optimizes on the proposed perceptual similarity metric.
1 code implementation • 28 Nov 2019 • Istvan Fehervari, Avinash Ravichandran, Srikar Appalaraju
Deep metric learning (DML) is a popular approach for images retrieval, solving verification (same or not) problems and addressing open set classification.
no code implementations • 9 Aug 2019 • Yash Patel, Srikar Appalaraju, R. Manmatha
Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG).
no code implementations • 18 Jul 2019 • Yash Patel, Srikar Appalaraju, R. Manmatha
In several cases, the MS-SSIM for deep learned techniques is higher than say a conventional, non-deep learned codec such as JPEG-2000 or BPG.
no code implementations • 19 Nov 2018 • Istvan Fehervari, Srikar Appalaraju
Logo recognition is a challenging problem as there is no clear definition of a logo and there are huge variations of logos, brands and re-training to cover every variation is impractical.
1 code implementation • 26 Sep 2017 • Srikar Appalaraju, Vineet Chaoji
Image similarity involves fetching similar looking images given a reference image.