8 code implementations • ICLR 2020 • Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra
We leverage this scaling to train an agent for 2. 5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.
Ranked #1 on PointGoal Navigation on Gibson PointGoal Navigation
2 code implementations • CVPR 2023 • Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa
We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation.
1 code implementation • 11 Oct 2022 • Erik Wijmans, Irfan Essa, Dhruv Batra
Specifically, the Pick skill involves a robot picking an object from a table.
1 code implementation • CVPR 2023 • Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
Ranked #1 on Video Prediction on Something-Something V2
1 code implementation • 9 Sep 2022 • José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa
Given a masked-and-reconstructed real image, the Token-Critic model is trained to distinguish which visual tokens belong to the original image and which were sampled by the generative transformer.
1 code implementation • 25 May 2023 • Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi
Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches.
3 code implementations • 1 Jun 2023 • Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan
Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts.
1 code implementation • ICLR 2022 • Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
Ranked #3 on Domain Generalization on Stylized-ImageNet
7 code implementations • 11 Sep 2017 • Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, Byron Boots
Low-shot learning methods for image classification support learning from sparse data.
1 code implementation • 2 Oct 2020 • Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra
We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?")
4 code implementations • 1 Jun 2018 • Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K. Marks, Chiori Hori
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.
2 code implementations • 21 Jun 2018 • Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh
We introduce a new dataset of dialogs about videos of human behaviors.
1 code implementation • 23 Jan 2018 • Daniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, Irfan Essa
We present a comparison of numerous state-of-the-art techniques on our dataset using three different representations (video, optical flow and multi-person pose data) in order to analyze these approaches.
2 code implementations • 25 Jan 2019 • Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
We introduce the task of scene-aware dialog.
1 code implementation • CVPR 2023 • Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang
We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers.
1 code implementation • CVPR 2014 • Steven Hickson, Stan Birchfield, Irfan Essa, Henrik Christensen
We present an efficient and scalable algorithm for segmenting 3D RGBD point clouds by combining depth, color, and temporal information using a multistage, hierarchical graph-based approach.
1 code implementation • 24 Oct 2022 • Apoorva Beedu, Huda Alamri, Irfan Essa
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos.
1 code implementation • 16 Jun 2019 • Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin Murphy, Irfan Essa
We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image.
Ranked #1 on Semantic Segmentation on ScanNetV2 (Pixel Accuracy metric)
1 code implementation • 11 Aug 2020 • Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa
In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community.
1 code implementation • 9 Dec 2021 • Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa
During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes.
1 code implementation • 26 Jan 2018 • Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar
We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision.
1 code implementation • 2 Aug 2017 • Steven Hickson, Irfan Essa, Henrik Christensen
Most of the approaches for indoor RGBD semantic la- beling focus on using pixels or superpixels to train a classi- fier.
1 code implementation • 11 Nov 2022 • Harish Haresamudram, Irfan Essa, Thomas Ploetz
The dichotomy between the challenging nature of obtaining annotations for activities, and the more straightforward nature of data collection from wearables, has resulted in significant interest in the development of techniques that utilize large quantities of unlabeled data for learning representations.
1 code implementation • 24 Nov 2023 • Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar
We propose a new segmentation process, Text + Click segmentation, where a model takes as input an image, a text phrase describing a class to segment, and a single foreground click specifying the instance to segment.
no code implementations • 1 Jun 2018 • Aneeq Zia, Andrew Hung, Irfan Essa, Anthony Jarc
Adverse surgical outcomes are costly to patients and hospitals.
no code implementations • 22 Jan 2018 • Unaiza Ahsan, Chen Sun, Irfan Essa
We propose an action recognition framework using Gen- erative Adversarial Networks.
no code implementations • 22 Dec 2017 • Aneeq Zia, Irfan Essa
In this paper, we explore the usage of different holistic features for automated skill assessment using only robot kinematic data and propose a weighted feature fusion technique for improving score prediction performance.
no code implementations • 22 Jul 2017 • Steven Hickson, Nick Dufour, Avneesh Sud, Vivek Kwatra, Irfan Essa
One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users.
no code implementations • 24 Feb 2017 • Aneeq Zia, Yachna Sharma, Vinay Bettadapura, Eric L. Sarin, Irfan Essa
Methods: We conduct the largest study, to the best of our knowledge, for basic surgical skills assessment on a dataset that contained video and accelerometer data for suturing and knot-tying tasks.
no code implementations • 17 Jan 2017 • Unaiza Ahsan, Chen Sun, James Hays, Irfan Essa
We propose to leverage concept-level representations for complex event recognition in photographs given limited training examples.
no code implementations • 18 Jan 2016 • Vinay Bettadapura, Daniel Castro, Irfan Essa
We present an approach for identifying picturesque highlights from large amounts of egocentric video data.
no code implementations • 25 Oct 2015 • S. Hussain Raza, Ahmad Humayun, Matthias Grundmann, David Anderson, Irfan Essa
Our proposed framework provides an efficient approach for finding temporally consistent occlusion boundaries in video by utilizing causality, redundancy in videos, and semantic layout of the scene.
no code implementations • CVPR 2013 • S. Hussain Raza, Matthias Grundmann, Irfan Essa
We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes.
no code implementations • 25 Oct 2015 • S. Hussain Raza, Omar Javed, Aveek Das, Harpreet Sawhney, Hui Cheng, Irfan Essa
We propose to learn and infer depth in videos from appearance, motion, occlusion boundaries, and geometric context of the scene.
no code implementations • 7 Oct 2015 • Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory Abowd, Irfan Essa
The pervasiveness of mobile cameras has resulted in a dramatic increase in food photos, which are pictures reflecting what people eat.
no code implementations • 7 Oct 2015 • Vinay Bettadapura, Irfan Essa, Caroline Pantofaru
We present a technique that uses images, videos and sensor data taken from first-person point-of-view devices to perform egocentric field-of-view (FOV) localization.
no code implementations • CVPR 2013 • Vinay Bettadapura, Grant Schindler, Thomaz Plotz, Irfan Essa
We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori.
no code implementations • 6 Oct 2015 • Daniel Castro, Steven Hickson, Vinay Bettadapura, Edison Thomaz, Gregory Abowd, Henrik Christensen, Irfan Essa
We collected a dataset of 40, 103 egocentric images over a 6 month period with 19 activity classes and demonstrate the benefit of state-of-the-art deep learning techniques for learning and predicting daily activities.
no code implementations • 8 Feb 2012 • Seungyeon Kim, Fuxin Li, Guy Lebanon, Irfan Essa
Sentiment analysis predicts the presence of positive or negative emotions in a text document.
no code implementations • 22 Aug 2018 • Unaiza Ahsan, Rishi Madhok, Irfan Essa
We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition.
no code implementations • 11 Sep 2018 • Jonathan C Balloch, Varun Agrawal, Irfan Essa, Sonia Chernova
We show that pretraining real-time segmentation architectures with synthetic segmentation data instead of ImageNet improves fine-tuning performance by reducing the bias learned in pretraining and closing the \textit{transfer gap} as a result.
no code implementations • CVPR 2019 • Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra
To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D).
no code implementations • 3 Jul 2019 • Aneeq Zia, Liheng Guo, Linlin Zhou, Irfan Essa, Anthony Jarc
Conclusions: We demonstrate that metrics-based evaluation of surgical activity recognition models is a viable approach to determine when models can be used to quantify surgical efficiencies.
no code implementations • 9 Jul 2019 • K. Niranjan Kumar, Irfan Essa, Sehoon Ha, C. Karen Liu
Using our method, we train a robotic arm to estimate the mass distribution of an object with moving parts (e. g. an articulated rigid body system) by pushing it on a surface with unknown friction properties.
no code implementations • ECCV 2020 • Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, Weilong Yang
The first module predicts a graph with complete relations from a graph with user-specified relations.
no code implementations • ICLR 2020 • Erik Wijmans, Julian Straub, Irfan Essa, Dhruv Batra, Judy Hoffman, Ari Morcos
Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures.
no code implementations • 12 Mar 2020 • Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos
Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task.
no code implementations • 9 Dec 2020 • Harish Haresamudram, Irfan Essa, Thomas Ploetz
Our work focuses on effective use of small amounts of labeled data and the opportunistic exploitation of unlabeled data that are straightforward to collect in mobile and ubiquitous computing scenarios.
no code implementations • 11 Dec 2020 • Erik Wijmans, Irfan Essa, Dhruv Batra
PointGoal navigation has seen significant recent interest and progress, spurred on by the Habitat platform and associated challenge.
no code implementations • 29 Mar 2021 • Dan Scarafoni, Irfan Essa, Thomas Ploetz
Action prediction focuses on anticipating actions before they happen.
no code implementations • 14 May 2021 • Nathan Frey, Peggy Chi, Weilong Yang, Irfan Essa
We propose an automatic approach that extracts editing styles in a source video and applies the edits to matched footage for video creation.
no code implementations • 7 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.
no code implementations • 28 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.
no code implementations • 29 Sep 2021 • Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song
Events across a timeline are a common data representation, seen in different temporal modalities.
no code implementations • 20 Nov 2021 • Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.
no code implementations • 11 Feb 2022 • Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song
Events across a timeline are a common data representation, seen in different temporal modalities.
no code implementations • 22 Feb 2022 • Harish Haresamudram, Irfan Essa, Thomas Plötz
As such, self-supervision, i. e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone hand-crafted features for the classic activity recognition chain.
no code implementations • 13 Oct 2022 • Daniel Scarafoni, Irfan Essa, Thomas Ploetz
We address dense action forecasting: the problem of predicting future action sequence over long durations based on partial observation.
no code implementations • 26 Oct 2022 • Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records.
no code implementations • 8 Nov 2022 • Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, Irfan Essa
In this work, we propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors, and show its robustness for MMAct and UTD-MHAD datasets.
no code implementations • 17 Dec 2022 • K. Niranjan Kumar, Irfan Essa, Sehoon Ha
Real-world autonomous missions often require rich interaction with nearby objects, such as doors or switches, along with effective navigation.
no code implementations • 30 Jan 2023 • Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra
A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial.
no code implementations • 1 Jun 2023 • Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa
We study domain-adaptive image synthesis, the problem of teaching pretrained image generative models a new style or concept from as few as one image to synthesize novel images, to better understand the compositional image synthesis.
no code implementations • 1 Jun 2023 • Harish Haresamudram, Irfan Essa, Thomas Ploetz
Based on an extensive experimental evaluation on a suite of wearables-based benchmark HAR tasks, we demonstrate the potential of our learned discretization scheme and discuss how discretized sensor data analysis can lead to substantial changes in HAR.
no code implementations • 3 Sep 2023 • Hyeongju Choi, Apoorva Beedu, Irfan Essa
However, a major component of successful contrastive learning is the selection of good positive and negative samples.
no code implementations • 9 Oct 2023 • Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation.
Ranked #2 on Video Prediction on Kinetics-600 12 frames, 64x64
no code implementations • 16 Oct 2023 • Tianle Huang, Nitish Sontakke, K. Niranjan Kumar, Irfan Essa, Stefanos Nikolaidis, Dennis W. Hong, Sehoon Ha
Domain randomization (DR), which entails training a policy with randomized dynamics, has proven to be a simple yet effective algorithm for reducing the gap between simulation and the real world.
no code implementations • NeurIPS 2023 • Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.
no code implementations • 11 Dec 2023 • Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
We present W. A. L. T, a transformer-based approach for photorealistic video generation via diffusion modeling.
Ranked #1 on Video Prediction on Kinetics-600 12 frames, 64x64
no code implementations • 21 Dec 2023 • Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals.
Ranked #3 on Text-to-Video Generation on MSR-VTT
no code implementations • 11 Jan 2024 • Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang
Additionally, Parrot employs a joint optimization approach for the T2I model and the prompt expansion network, facilitating the generation of quality-aware text prompts, thus further enhancing the final image quality.
no code implementations • 23 Jan 2024 • Apoorva Beedu, Karan Samel, Irfan Essa
Compared to existing methods, MAT has the advantage of learning additional environmental context from two kinds of text inputs: action descriptions during the pre-training stage, and the text inputs for detected objects and actions during modality feature fusion.
no code implementations • 19 Mar 2024 • Vincent Cartillier, Neha Jain, Irfan Essa
Its task is to detect and re-identify objects in 3D - e. g. a "sofa" moved from location A to B, a new "chair" in the second layout at location C, or a "lamp" from location D in the first layout missing in the second.