Search Results for author: Irfan Essa

Found 73 papers, 24 papers with code

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

8 code implementations ICLR 2020 Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra

We leverage this scaling to train an agent for 2. 5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.

Autonomous Navigation Navigate +2

MaskSketch: Unpaired Structure-guided Masked Image Generation

2 code implementations CVPR 2023 Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa

We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation.

Conditional Image Generation Image-to-Image Translation +2

Improved Masked Image Generation with Token-Critic

1 code implementation9 Sep 2022 José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa

Given a masked-and-reconstructed real image, the Token-Critic model is trained to distinguish which visual tokens belong to the original image and which were sampled by the generative transformer.

Image Generation

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models

1 code implementation25 May 2023 Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi

Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches.

Conditional Text-to-Image Synthesis Image Generation +3

Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views

1 code implementation2 Oct 2020 Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra

We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?")

Representation Learning

Let's Dance: Learning From Online Dance Videos

1 code implementation23 Jan 2018 Daniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, Irfan Essa

We present a comparison of numerous state-of-the-art techniques on our dataset using three different representations (video, optical flow and multi-person pose data) in order to analyze these approaches.

Action Recognition Optical Flow Estimation +1

Visual Prompt Tuning for Generative Transfer Learning

1 code implementation CVPR 2023 Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang

We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers.

Image Generation Transfer Learning +1

Efficient Hierarchical Graph-Based Segmentation of RGBD Videos

1 code implementation CVPR 2014 Steven Hickson, Stan Birchfield, Irfan Essa, Henrik Christensen

We present an efficient and scalable algorithm for segmenting 3D RGBD point clouds by combining depth, color, and temporal information using a multistage, hierarchical graph-based approach.

Clustering Graph Matching +2

Video based Object 6D Pose Estimation using Transformers

1 code implementation24 Oct 2022 Apoorva Beedu, Huda Alamri, Irfan Essa

We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos.

6D Pose Estimation 6D Pose Estimation using RGB +1

Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

1 code implementation16 Jun 2019 Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin Murphy, Irfan Essa

We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image.

 Ranked #1 on Semantic Segmentation on ScanNetV2 (Pixel Accuracy metric)

Semantic Segmentation Surface Normals Estimation +1

Text as Neural Operator: Image Manipulation by Text Instruction

1 code implementation11 Aug 2020 Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa

In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community.

Conditional Image Generation Image Captioning +2

BLT: Bidirectional Layout Transformer for Controllable Layout Generation

1 code implementation9 Dec 2021 Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa

During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes.

Object category learning and retrieval with weak supervision

1 code implementation26 Jan 2018 Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar

We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision.

Clustering Deep Clustering +2

Semantic Instance Labeling Leveraging Hierarchical Segmentation

1 code implementation2 Aug 2017 Steven Hickson, Irfan Essa, Henrik Christensen

Most of the approaches for indoor RGBD semantic la- beling focus on using pixels or superpixels to train a classi- fier.

Segmentation Superpixels

Investigating Enhancements to Contrastive Predictive Coding for Human Activity Recognition

1 code implementation11 Nov 2022 Harish Haresamudram, Irfan Essa, Thomas Ploetz

The dichotomy between the challenging nature of obtaining annotations for activities, and the more straightforward nature of data collection from wearables, has resulted in significant interest in the development of techniques that utilize large quantities of unlabeled data for learning representations.

Human Activity Recognition Time Series +1

Text and Click inputs for unambiguous open vocabulary instance segmentation

1 code implementation24 Nov 2023 Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar

We propose a new segmentation process, Text + Click segmentation, where a model takes as input an image, a text phrase describing a class to segment, and a single foreground click specifying the instance to segment.

Instance Segmentation Segmentation +1

Automated Surgical Skill Assessment in RMIS Training

no code implementations22 Dec 2017 Aneeq Zia, Irfan Essa

In this paper, we explore the usage of different holistic features for automated skill assessment using only robot kinematic data and propose a weighted feature fusion technique for improving score prediction performance.

General Classification

Eyemotion: Classifying facial expressions in VR using eye-tracking cameras

no code implementations22 Jul 2017 Steven Hickson, Nick Dufour, Avneesh Sud, Vivek Kwatra, Irfan Essa

One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users.

Blocking

Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills Assessment

no code implementations24 Feb 2017 Aneeq Zia, Yachna Sharma, Vinay Bettadapura, Eric L. Sarin, Irfan Essa

Methods: We conduct the largest study, to the best of our knowledge, for basic surgical skills assessment on a dataset that contained video and accelerometer data for suturing and knot-tying tasks.

Skills Assessment Time Series Analysis

Complex Event Recognition from Images with Few Training Examples

no code implementations17 Jan 2017 Unaiza Ahsan, Chen Sun, James Hays, Irfan Essa

We propose to leverage concept-level representations for complex event recognition in photographs given limited training examples.

Discovering Picturesque Highlights from Egocentric Vacation Videos

no code implementations18 Jan 2016 Vinay Bettadapura, Daniel Castro, Irfan Essa

We present an approach for identifying picturesque highlights from large amounts of egocentric video data.

Highlight Detection

Finding Temporally Consistent Occlusion Boundaries in Videos using Geometric Context

no code implementations25 Oct 2015 S. Hussain Raza, Ahmad Humayun, Matthias Grundmann, David Anderson, Irfan Essa

Our proposed framework provides an efficient approach for finding temporally consistent occlusion boundaries in video by utilizing causality, redundancy in videos, and semantic layout of the scene.

Geometric Context from Videos

no code implementations CVPR 2013 S. Hussain Raza, Matthias Grundmann, Irfan Essa

We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes.

Segmentation Video Segmentation +1

Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries

no code implementations25 Oct 2015 S. Hussain Raza, Omar Javed, Aveek Das, Harpreet Sawhney, Hui Cheng, Irfan Essa

We propose to learn and infer depth in videos from appearance, motion, occlusion boundaries, and geometric context of the scene.

Depth Estimation Pose Estimation

Leveraging Context to Support Automated Food Recognition in Restaurants

no code implementations7 Oct 2015 Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory Abowd, Irfan Essa

The pervasiveness of mobile cameras has resulted in a dramatic increase in food photos, which are pictures reflecting what people eat.

Food Recognition

Egocentric Field-of-View Localization Using First-Person Point-of-View Devices

no code implementations7 Oct 2015 Vinay Bettadapura, Irfan Essa, Caroline Pantofaru

We present a technique that uses images, videos and sensor data taken from first-person point-of-view devices to perform egocentric field-of-view (FOV) localization.

Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

no code implementations CVPR 2013 Vinay Bettadapura, Grant Schindler, Thomaz Plotz, Irfan Essa

We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori.

Activity Recognition

Predicting Daily Activities From Egocentric Images Using Deep Learning

no code implementations6 Oct 2015 Daniel Castro, Steven Hickson, Vinay Bettadapura, Edison Thomaz, Gregory Abowd, Henrik Christensen, Irfan Essa

We collected a dataset of 40, 103 egocentric images over a 6 month period with 19 activity classes and demonstrate the benefit of state-of-the-art deep learning techniques for learning and predicting daily activities.

Classification General Classification

Beyond Sentiment: The Manifold of Human Emotions

no code implementations8 Feb 2012 Seungyeon Kim, Fuxin Li, Guy Lebanon, Irfan Essa

Sentiment analysis predicts the presence of positive or negative emotions in a text document.

Sentiment Analysis

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

no code implementations22 Aug 2018 Unaiza Ahsan, Rishi Madhok, Irfan Essa

We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition.

Action Recognition Optical Flow Estimation +5

Unbiasing Semantic Segmentation For Robot Perception using Synthetic Data Feature Transfer

no code implementations11 Sep 2018 Jonathan C Balloch, Varun Agrawal, Irfan Essa, Sonia Chernova

We show that pretraining real-time segmentation architectures with synthetic segmentation data instead of ImageNet improves fine-tuning performance by reducing the bias learned in pretraining and closing the \textit{transfer gap} as a result.

Image Segmentation Segmentation +1

Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

no code implementations CVPR 2019 Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D).

Embodied Question Answering Question Answering

Novel evaluation of surgical activity recognition models using task-based efficiency metrics

no code implementations3 Jul 2019 Aneeq Zia, Liheng Guo, Linlin Zhou, Irfan Essa, Anthony Jarc

Conclusions: We demonstrate that metrics-based evaluation of surgical activity recognition models is a viable approach to determine when models can be used to quantify surgical efficiencies.

Activity Recognition

Estimating Mass Distribution of Articulated Objects using Non-prehensile Manipulation

no code implementations9 Jul 2019 K. Niranjan Kumar, Irfan Essa, Sehoon Ha, C. Karen Liu

Using our method, we train a robotic arm to estimate the mass distribution of an object with moving parts (e. g. an articulated rigid body system) by pushing it on a surface with unknown friction properties.

Friction Object

Insights on Visual Representations for Embodied Navigation Tasks

no code implementations ICLR 2020 Erik Wijmans, Julian Straub, Irfan Essa, Dhruv Batra, Judy Hoffman, Ari Morcos

Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures.

Analyzing Visual Representations in Embodied Navigation Tasks

no code implementations12 Mar 2020 Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos

Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task.

Reinforcement Learning (RL)

Contrastive Predictive Coding for Human Activity Recognition

no code implementations9 Dec 2020 Harish Haresamudram, Irfan Essa, Thomas Ploetz

Our work focuses on effective use of small amounts of labeled data and the opportunistic exploitation of unlabeled data that are straightforward to collect in mobile and ubiquitous computing scenarios.

Human Activity Recognition

How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget

no code implementations11 Dec 2020 Erik Wijmans, Irfan Essa, Dhruv Batra

PointGoal navigation has seen significant recent interest and progress, spurred on by the Habitat platform and associated challenge.

PointGoal Navigation

Automatic Non-Linear Video Editing Transfer

no code implementations14 May 2021 Nathan Frey, Peggy Chi, Weilong Yang, Irfan Essa

We propose an automatic approach that extracts editing styles in a source video and applies the edits to matched footage for video creation.

Video Editing

Unsupervised Action Segmentation for Instructional Videos

no code implementations7 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.

Action Segmentation Segmentation

Unsupervised Discovery of Actions in Instructional Videos

no code implementations28 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.

Neural Temporal Logic Programming

no code implementations29 Sep 2021 Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song

Events across a timeline are a common data representation, seen in different temporal modalities.

VideoPose: Estimating 6D object pose from videos

no code implementations20 Nov 2021 Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa

We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.

Object Pose Estimation

Learning Temporal Rules from Noisy Timeseries Data

no code implementations11 Feb 2022 Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song

Events across a timeline are a common data representation, seen in different temporal modalities.

Assessing the State of Self-Supervised Human Activity Recognition using Wearables

no code implementations22 Feb 2022 Harish Haresamudram, Irfan Essa, Thomas Plötz

As such, self-supervision, i. e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone hand-crafted features for the classic activity recognition chain.

Domain Adaptation Human Activity Recognition +1

Finding Islands of Predictability in Action Forecasting

no code implementations13 Oct 2022 Daniel Scarafoni, Irfan Essa, Thomas Ploetz

We address dense action forecasting: the problem of predicting future action sequence over long durations based on partial observation.

End-to-End Multimodal Representation Learning for Video Dialog

no code implementations26 Oct 2022 Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa

Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records.

Representation Learning Retrieval

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

no code implementations8 Nov 2022 Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, Irfan Essa

In this work, we propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors, and show its robustness for MMAct and UTD-MHAD datasets.

Human Activity Recognition

Cascaded Compositional Residual Learning for Complex Interactive Behaviors

no code implementations17 Dec 2022 K. Niranjan Kumar, Irfan Essa, Sehoon Ha

Real-world autonomous missions often require rich interaction with nearby objects, such as doors or switches, along with effective navigation.

Emergence of Maps in the Memories of Blind Navigation Agents

no code implementations30 Jan 2023 Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra

A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial.

Inductive Bias PointGoal Navigation

Learning Disentangled Prompts for Compositional Image Synthesis

no code implementations1 Jun 2023 Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa

We study domain-adaptive image synthesis, the problem of teaching pretrained image generative models a new style or concept from as few as one image to synthesize novel images, to better understand the compositional image synthesis.

Domain Adaptation Image Generation +1

Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition

no code implementations1 Jun 2023 Harish Haresamudram, Irfan Essa, Thomas Ploetz

Based on an extensive experimental evaluation on a suite of wearables-based benchmark HAR tasks, we demonstrate the potential of our learned discretization scheme and discuss how discretized sensor data analysis can lead to substantial changes in HAR.

Human Activity Recognition Quantization

BayRnTune: Adaptive Bayesian Domain Randomization via Strategic Fine-tuning

no code implementations16 Oct 2023 Tianle Huang, Nitish Sontakke, K. Niranjan Kumar, Irfan Essa, Stefanos Nikolaidis, Dennis W. Hong, Sehoon Ha

Domain randomization (DR), which entails training a policy with randomized dynamics, has proven to be a simple yet effective algorithm for reducing the gap between simulation and the real world.

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

no code implementations NeurIPS 2023 Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.

In-Context Learning multimodal generation

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

no code implementations11 Jan 2024 Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang

Additionally, Parrot employs a joint optimization approach for the T2I model and the prompt expansion network, facilitating the generation of quality-aware text prompts, thus further enhancing the final image quality.

Reinforcement Learning (RL) Text-to-Image Generation

On the Efficacy of Text-Based Input Modalities for Action Anticipation

no code implementations23 Jan 2024 Apoorva Beedu, Karan Samel, Irfan Essa

Compared to existing methods, MAT has the advantage of learning additional environmental context from two kinds of text inputs: action descriptions during the pre-training stage, and the text inputs for detected objects and actions during modality feature fusion.

Action Anticipation

3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D

no code implementations19 Mar 2024 Vincent Cartillier, Neha Jain, Irfan Essa

Its task is to detect and re-identify objects in 3D - e. g. a "sofa" moved from location A to B, a new "chair" in the second layout at location C, or a "lamp" from location D in the first layout missing in the second.

Object

Cannot find the paper you are looking for? You can Submit a new open access paper.