Search Results for author: Rohit Girdhar

Found 38 papers, 24 papers with code

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

no code implementations • 8 Apr 2024 • Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Paper
Add Code

InstanceDiffusion: Instance-level Control for Image Generation

1 code implementation • 5 Feb 2024 • Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image.

Ranked #3 on Conditional Text-to-Image Synthesis on COCO-MIG

Conditional Text-to-Image Synthesis Image Generation +2

291

Paper
Code

Generating Illustrated Instructions

1 code implementation • 7 Dec 2023 • Sachit Menon, Ishan Misra, Rohit Girdhar

We introduce the new task of generating Illustrated Instructions, i. e., visual instructions customized to a user's needs.

Text-to-Image Generation

Paper
Code

Motion-Conditioned Image Animation for Video Editing

no code implementations • 30 Nov 2023 • Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing.

Image Animation Video Editing

Paper
Add Code

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

no code implementations • 17 Nov 2023 • Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image.

Text-to-Video Generation Video Generation

Paper
Add Code

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

1 code implementation • 28 Aug 2023 • Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions.

Instance Segmentation Optical Flow Estimation +5

864

Paper
Code

ImageBind: One Embedding Space To Bind Them All

1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.

Ranked #2 on Zero-shot Classification (unified classes) on LLVIP

Cross-Modal Retrieval Retrieval +7

7,853

Paper
Code

The effectiveness of MAE pre-pretraining for billion-scale pretraining

1 code implementation • ICCV 2023 • Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.

Ranked #1 on Few-Shot Image Classification on ImageNet - 10-shot (using extra training data)

Action Classification Action Recognition +6

Paper
Code

Learning to Substitute Ingredients in Recipes

1 code implementation • 15 Feb 2023 • Bahare Fatemi, Quentin Duval, Rohit Girdhar, Michal Drozdzal, Adriana Romero-Soriano

Recipe personalization through ingredient substitution has the potential to help people meet their dietary needs and preferences, avoid potential allergens, and ease culinary exploration in everyone's kitchen.

Recipe Generation

Paper
Code

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

2 code implementations • CVPR 2023 • Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra

We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models.

Ranked #1 on Unsupervised Instance Segmentation on UVO

Instance Segmentation object-detection +5

864

Paper
Code

HierVL: Learning Hierarchical Video-Language Embeddings

no code implementations • CVPR 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text.

Ranked #3 on Action Recognition on Charades-Ego

Action Classification Action Recognition +3

Paper
Add Code

What You Say Is What You Show: Visual Narration Detection in Instructional Videos

no code implementations • 5 Jan 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies.

Paper
Add Code

Learning Video Representations from Large Language Models

2 code implementations • CVPR 2023 • Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs).

Ranked #1 on Action Recognition on Charades-Ego

Action Classification Action Recognition +2

434

Paper
Code

OmniMAE: Single Model Masked Pretraining on Images and Videos

1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures.

544

Paper
Code

Omnivore: A Single Model for Many Visual Modalities

2 code implementations • CVPR 2022 • Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.

Ranked #1 on Scene Recognition on SUN-RGBD (using extra training data)

Action Classification Action Recognition +3

2,972

Paper
Code

Detecting Twenty-thousand Classes using Image-level Supervision

1 code implementation • 7 Jan 2022 • Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.

Ranked #2 on Open Vocabulary Object Detection on OpenImages-v4

Image Classification Open Vocabulary Object Detection

1,762

Paper
Code

Mask2Former for Video Instance Segmentation

5 code implementations • 20 Dec 2021 • Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.

Ranked #14 on Video Instance Segmentation on YouTube-VIS validation

Image Segmentation Instance Segmentation +5

124,527

Paper
Code

Masked-attention Mask Transformer for Universal Image Segmentation

6 code implementations • CVPR 2022 • Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

While only the semantics of each task differ, current research focuses on designing specialized architectures for each task.

Ranked #3 on Semantic Segmentation on Mapillary val

Image Segmentation Instance Segmentation +3

124,527

Paper
Code

Ego4D: Around the World in 3,000 Hours of Egocentric Video

6 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.

De-identification Ethics

4,978

Paper
Code

An End-to-End Transformer Model for 3D Object Detection

1 code implementation • ICCV 2021 • Ishan Misra, Rohit Girdhar, Armand Joulin

We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds.

Ranked #18 on 3D Object Detection on ScanNetV2

3D Object Detection object-detection

587

Paper
Code

Anticipative Video Transformer

1 code implementation • ICCV 2021 • Rohit Girdhar, Kristen Grauman

We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions.

Ranked #2 on Action Anticipation on EPIC-KITCHENS-100 (test) (using extra training data)

Action Anticipation

152

Paper
Code

3D Spatial Recognition without Spatially Labeled 3D

1 code implementation • CVPR 2021 • Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar

We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition, requiring only scene-level class tags as supervision.

3D Object Detection Multiple Instance Learning +5

Paper
Code

Physical Reasoning Using Dynamics-Aware Models

1 code implementation • 20 Feb 2021 • Eltayeb Ahmed, Anton Bakhtin, Laurens van der Maaten, Rohit Girdhar

A common approach to solving physical reasoning tasks is to train a value learner on example tasks.

Ranked #1 on Visual Reasoning on PHYRE-1B-Within

Visual Reasoning

Paper
Code

Self-Supervised Pretraining of 3D Features on any Point-Cloud

1 code implementation • ICCV 2021 • Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc.

Object object-detection +4

263

Paper
Code

Forward Prediction for Physical Reasoning

1 code implementation • 18 Jun 2020 • Rohit Girdhar, Laura Gustafson, Aaron Adcock, Laurens van der Maaten

Physical reasoning requires forward prediction: the ability to forecast what will happen next given some initial world state.

Ranked #2 on Visual Reasoning on PHYRE-1B-Within

Visual Reasoning

Paper
Code

Video Understanding as Machine Translation

no code implementations • 12 Jun 2020 • Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani

With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations.

Machine Translation Metric Learning +6

Paper
Add Code

CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning

no code implementations • ICLR 2020 • Rohit Girdhar, Deva Ramanan

In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved.

Object Video Understanding

Paper
Add Code

Are we asking the right questions in MovieQA?

no code implementations • 8 Nov 2019 • Bhavan Jasani, Rohit Girdhar, Deva Ramanan

Joint vision and language tasks like visual question answering are fascinating because they explore high-level understanding, but at the same time, can be more prone to language biases.

Question Answering Visual Question Answering

Paper
Add Code

MetaPix: Few-Shot Video Retargeting

no code implementations • ICLR 2020 • Jessica Lee, Deva Ramanan, Rohit Girdhar

We address the task of unsupervised retargeting of human actions from one video to another.

Meta-Learning

Paper
Add Code

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

2 code implementations • 10 Oct 2019 • Rohit Girdhar, Deva Ramanan

In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved.

Object Video Object Tracking +1

103

Paper
Code

DistInit: Learning Video Representations Without a Single Labeled Video

no code implementations • ICCV 2019 • Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan

In this work, we propose an alternative approach to learning video representations that require no semantically labeled videos and instead leverages the years of effort in collecting and labeling large and clean still-image datasets.

Ranked #72 on Action Recognition on HMDB-51 (using extra training data)

Action Recognition Temporal Action Localization +1

Paper
Add Code

Video Action Transformer Network

no code implementations • CVPR 2019 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

We introduce the Action Transformer model for recognizing and localizing human actions in video clips.

Ranked #6 on Action Recognition on AVA v2.1

Action Recognition Recognizing And Localizing Human Actions

Paper
Add Code

A Better Baseline for AVA

no code implementations • 26 Jul 2018 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

We introduce a simple baseline for action localization on the AVA dataset.

Ranked #12 on Action Recognition on AVA v2.1

Action Localization Action Recognition

Paper
Add Code

Binge Watching: Scaling Affordance Learning from Sitcoms

no code implementations • CVPR 2017 • Xiaolong Wang, Rohit Girdhar, Abhinav Gupta

In this paper, we tackle the challenge of creating one of the biggest dataset for learning affordances.

Paper
Add Code

Detect-and-Track: Efficient Pose Estimation in Videos

1 code implementation • CVPR 2018 • Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.

Ranked #8 on Pose Tracking on PoseTrack2017 (using extra training data)

Human Detection Keypoint Estimation +4

1,002

Paper
Code

Attentional Pooling for Action Recognition

1 code implementation • NeurIPS 2017 • Rohit Girdhar, Deva Ramanan

We introduce a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks.

Ranked #7 on Human-Object Interaction Detection on HICO

Action Recognition Human-Object Interaction Detection +1

258

Paper
Code

ActionVLAD: Learning spatio-temporal aggregation for action classification

no code implementations • CVPR 2017 • Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video.

Ranked #8 on Long-video Activity Recognition on Breakfast

Action Classification Classification +3

Paper
Add Code

Learning a Predictable and Generative Vector Representation for Objects

2 code implementations • 29 Mar 2016 • Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta

The network consists of two components: (a) an autoencoder that ensures the representation is generative; and (b) a convolutional network that ensures the representation is predictable.

Retrieval

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.