1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
Ranked #6 on
Zero-Shot Video Retrieval
on MSR-VTT
no code implementations • 23 Mar 2023 • Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.
Ranked #1 on
Zero-Shot Transfer Image Classification
on Food-101
(using extra training data)
1 code implementation • 15 Feb 2023 • Bahare Fatemi, Quentin Duval, Rohit Girdhar, Michal Drozdzal, Adriana Romero-Soriano
Recipe personalization through ingredient substitution has the potential to help people meet their dietary needs and preferences, avoid potential allergens, and ease culinary exploration in everyone's kitchen.
1 code implementation • CVPR 2023 • Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra
We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models.
Ranked #1 on
Unsupervised Instance Segmentation
on UVO
no code implementations • 5 Jan 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Narrated "how-to" videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies.
no code implementations • CVPR 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text.
Ranked #1 on
Action Recognition
on Charades-Ego
1 code implementation • CVPR 2023 • Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs).
1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures.
2 code implementations • CVPR 2022 • Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra
Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.
Ranked #1 on
Scene Recognition
on SUN-RGBD
(using extra training data)
1 code implementation • 7 Jan 2022 • Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra
For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.
Ranked #2 on
Open Vocabulary Object Detection
on OpenImages-v4
4 code implementations • 20 Dec 2021 • Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing
We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.
5 code implementations • CVPR 2022 • Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar
While only the semantics of each task differ, current research focuses on designing specialized architectures for each task.
Ranked #2 on
Semantic Segmentation
on Mapillary val
3 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
1 code implementation • ICCV 2021 • Ishan Misra, Rohit Girdhar, Armand Joulin
We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds.
Ranked #13 on
3D Object Detection
on ScanNetV2
1 code implementation • ICCV 2021 • Rohit Girdhar, Kristen Grauman
We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions.
Ranked #1 on
Action Anticipation
on EPIC-KITCHENS-100 (test)
(using extra training data)
no code implementations • CVPR 2021 • Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar
We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition, requiring only scene-level class tags as supervision.
1 code implementation • 20 Feb 2021 • Eltayeb Ahmed, Anton Bakhtin, Laurens van der Maaten, Rohit Girdhar
A common approach to solving physical reasoning tasks is to train a value learner on example tasks.
Ranked #1 on
Visual Reasoning
on PHYRE-1B-Within
1 code implementation • ICCV 2021 • Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra
Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc.
1 code implementation • 18 Jun 2020 • Rohit Girdhar, Laura Gustafson, Aaron Adcock, Laurens van der Maaten
Physical reasoning requires forward prediction: the ability to forecast what will happen next given some initial world state.
Ranked #2 on
Visual Reasoning
on PHYRE-1B-Within
no code implementations • 12 Jun 2020 • Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani
With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations.
no code implementations • ICLR 2020 • Rohit Girdhar, Deva Ramanan
In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved.
no code implementations • 8 Nov 2019 • Bhavan Jasani, Rohit Girdhar, Deva Ramanan
Joint vision and language tasks like visual question answering are fascinating because they explore high-level understanding, but at the same time, can be more prone to language biases.
no code implementations • ICLR 2020 • Jessica Lee, Deva Ramanan, Rohit Girdhar
We address the task of unsupervised retargeting of human actions from one video to another.
1 code implementation • 10 Oct 2019 • Rohit Girdhar, Deva Ramanan
In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved.
no code implementations • ICCV 2019 • Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan
In this work, we propose an alternative approach to learning video representations that require no semantically labeled videos and instead leverages the years of effort in collecting and labeling large and clean still-image datasets.
Ranked #70 on
Action Recognition
on HMDB-51
(using extra training data)
no code implementations • CVPR 2019 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce the Action Transformer model for recognizing and localizing human actions in video clips.
Ranked #5 on
Action Recognition
on AVA v2.1
no code implementations • 26 Jul 2018 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce a simple baseline for action localization on the AVA dataset.
Ranked #11 on
Action Recognition
on AVA v2.1
no code implementations • CVPR 2017 • Xiaolong Wang, Rohit Girdhar, Abhinav Gupta
In this paper, we tackle the challenge of creating one of the biggest dataset for learning affordances.
1 code implementation • CVPR 2018 • Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
Ranked #7 on
Keypoint Detection
on COCO test-challenge
1 code implementation • NeurIPS 2017 • Rohit Girdhar, Deva Ramanan
We introduce a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks.
Ranked #7 on
Human-Object Interaction Detection
on HICO
no code implementations • CVPR 2017 • Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video.
2 code implementations • 29 Mar 2016 • Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta
The network consists of two components: (a) an autoencoder that ensures the representation is generative; and (b) a convolutional network that ensures the representation is predictable.