no code implementations • 23 Jan 2024 • Apoorva Beedu, Karan Samel, Irfan Essa
Compared to existing methods, MAT has the advantage of learning additional environmental context from two kinds of text inputs: action descriptions during the pre-training stage, and the text inputs for detected objects and actions during modality feature fusion.
no code implementations • 3 Sep 2023 • Hyeongju Choi, Apoorva Beedu, Irfan Essa
However, a major component of successful contrastive learning is the selection of good positive and negative samples.
no code implementations • 8 Nov 2022 • Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, Irfan Essa
In this work, we propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors, and show its robustness for MMAct and UTD-MHAD datasets.
no code implementations • 26 Oct 2022 • Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records.
1 code implementation • 24 Oct 2022 • Apoorva Beedu, Huda Alamri, Irfan Essa
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos.
no code implementations • 20 Nov 2021 • Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.