1 code implementation • 28 Aug 2023 • Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell
Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions.
no code implementations • CVPR 2023 • Sagar Vaze, Nicolas Carion, Ishan Misra
In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions.
1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
Ranked #7 on
Zero-Shot Video Retrieval
on MSR-VTT
5 code implementations • 14 Apr 2023 • Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision.
Ranked #1 on
Image Retrieval
on AmsterTime
(using extra training data)
no code implementations • ICCV 2023 • Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, Abhinav Shrivastava
In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images.
no code implementations • ICCV 2023 • Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.
Ranked #1 on
Zero-Shot Transfer Image Classification
on Food-101
(using extra training data)
1 code implementation • 28 Feb 2023 • Sangwoo Mo, Jong-Chyi Su, Chih-Yao Ma, Mido Assran, Ishan Misra, Licheng Yu, Sean Bell
Semi-supervised learning aims to train a model using limited labels.
1 code implementation • CVPR 2023 • Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra
We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models.
Ranked #1 on
Unsupervised Instance Segmentation
on UVO
no code implementations • 26 Jan 2023 • Melissa Hall, Laura Gustafson, Aaron Adcock, Ishan Misra, Candace Ross
With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation?
no code implementations • 23 Jan 2023 • Quentin Duval, Ishan Misra, Nicolas Ballas
Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model.
3 code implementations • CVPR 2023 • Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann Lecun, Nicolas Ballas
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations.
1 code implementation • CVPR 2023 • Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs).
Ranked #1 on
Action Recognition
on Charades-Ego
no code implementations • 13 Oct 2022 • Yang Fu, Ishan Misra, Xiaolong Wang
We propose a generalizable neural radiance fields - MonoNeRF, that can be trained on large-scale monocular videos of moving in static scenes without any ground-truth annotations of depth and camera poses.
no code implementations • 13 Oct 2022 • Mahmoud Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Nicolas Ballas
A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e. g., SimCLR, VICReg, SwAV, MSN).
1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures.
2 code implementations • 14 Apr 2022 • Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations.
Self-Supervised Image Classification
Self-Supervised Learning
+1
no code implementations • 16 Feb 2022 • Randall Balestriero, Ishan Misra, Yann Lecun
We show that for a training loss to be stable under DA sampling, the model's saliency map (gradient of the loss with respect to the model's input) must align with the smallest eigenvector of the sample variance under the considered DA augmentation, hinting at a possible explanation on why models tend to shift their focus from edges to textures.
1 code implementation • 16 Feb 2022 • Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Ishan Misra, Levent Sagun, Armand Joulin, Piotr Bojanowski
Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images.
Ranked #1 on
Out-of-Distribution Generalization
on ImageNet-W
(using extra training data)
2 code implementations • CVPR 2022 • Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra
Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.
Ranked #1 on
Scene Recognition
on SUN-RGBD
(using extra training data)
1 code implementation • 7 Jan 2022 • Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra
For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.
Ranked #2 on
Open Vocabulary Object Detection
on OpenImages-v4
4 code implementations • 20 Dec 2021 • Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing
We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.
Ranked #10 on
Video Instance Segmentation
on YouTube-VIS validation
5 code implementations • CVPR 2022 • Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar
While only the semantics of each task differ, current research focuses on designing specialized architectures for each task.
Ranked #2 on
Semantic Segmentation
on Mapillary val
no code implementations • ICLR 2022 • Omri Puny, Matan Atzmon, Heli Ben-Hamu, Ishan Misra, Aditya Grover, Edward J. Smith, Yaron Lipman
For example, Euclidean motion invariant/equivariant graph or point cloud neural networks.
1 code implementation • ICCV 2021 • Ishan Misra, Rohit Girdhar, Armand Joulin
We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds.
Ranked #14 on
3D Object Detection
on ScanNetV2
2 code implementations • NeurIPS 2021 • Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques
In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.
Ranked #14 on
Action Recognition
on EPIC-KITCHENS-100
(using extra training data)
no code implementations • CVPR 2021 • Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar
We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition, requiring only scene-level class tags as supervision.
23 code implementations • ICCV 2021 • Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).
Ranked #2 on
Visual Place Recognition
on Laurel Caverns
4 code implementations • ICCV 2021 • Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, Michael Rabbat
This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS).
2 code implementations • 26 Apr 2021 • Aishwarya Kamath, Mannat Singh, Yann Lecun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #1 on
Visual Question Answering (VQA)
on CLEVR-Humans
Generalized Referring Expression Comprehension
Phrase Grounding
+9
no code implementations • CVPR 2021 • Pedro Morgado, Ishan Misra, Nuno Vasconcelos
Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives.
1 code implementation • ICCV 2021 • Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi
First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.
23 code implementations • 4 Mar 2021 • Jure Zbontar, Li Jing, Ishan Misra, Yann Lecun, Stéphane Deny
This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.
Ranked #11 on
Image Classification
on Places205
1 code implementation • 2 Mar 2021 • Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods.
Ranked #6 on
Image Classification
on Places205
Self-Supervised Image Classification
Self-Supervised Learning
+1
1 code implementation • ICCV 2021 • Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra
Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc.
1 code implementation • ICCV 2021 • Aishwarya Kamath, Mannat Singh, Yann Lecun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #2 on
Referring Expression Comprehension
on Talk2Car
(using extra training data)
no code implementations • 25 Nov 2020 • Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille
To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL.
15 code implementations • NeurIPS 2020 • Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin
In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much.
Ranked #1 on
Contrastive Learning
on imagenet-1k
1 code implementation • CVPR 2021 • Pedro Morgado, Nuno Vasconcelos, Ishan Misra
Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa.
Ranked #3 on
Self-Supervised Audio Classification
on ESC-50
2 code implementations • CVPR 2020 • Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen
Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA).
Ranked #18 on
Visual Question Answering (VQA)
on VQA v2 test-std
1 code implementation • CVPR 2020 • Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, Dhruv Mahajan
Pre-training convolutional neural networks with weakly-supervised and self-supervised strategies is becoming increasingly popular for several computer vision tasks.
Ranked #52 on
Image Classification
on iNaturalist 2018
7 code implementations • CVPR 2020 • Ishan Misra, Laurens van der Maaten
The goal of self-supervised learning from images is to construct image representations that are semantically meaningful via pretext tasks that do not require semantic annotations for a large training set of images.
Ranked #7 on
Contrastive Learning
on imagenet-1k
no code implementations • 6 Jun 2019 • Terrance DeVries, Ishan Misra, Changhan Wang, Laurens van der Maaten
The paper analyzes the accuracy of publicly available object-recognition systems on a geographically diverse dataset.
no code implementations • ICCV 2019 • Nilesh Kulkarni, Ishan Misra, Shubham Tulsiani, Abhinav Gupta
We propose an approach to predict the 3D shape and pose for the objects present in a scene.
2 code implementations • ICCV 2019 • Priya Goyal, Dhruv Mahajan, Abhinav Gupta, Ishan Misra
Self-supervised learning aims to learn representations from the data itself without explicit manual supervision.
no code implementations • 19 Jan 2019 • Hexiang Hu, Ishan Misra, Laurens van der Maaten
Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision.
no code implementations • CVPR 2018 • Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten
We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.
6 code implementations • ICCV 2017 • Debidatta Dwibedi, Ishan Misra, Martial Hebert
In this paper, we propose a simple approach to generate large annotated instance datasets with minimal effort.
no code implementations • CVPR 2017 • Ishan Misra, Abhinav Gupta, Martial Hebert
In this paper, we present a simple method that respects contextuality in order to compose classifiers of known visual concepts.
1 code implementation • NAACL 2016 • Ting-Hao, Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, Margaret Mitchell
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling.
1 code implementation • CVPR 2016 • Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, Martial Hebert
In this paper, we propose a principled approach to learn shared representations in ConvNets using multi-task learning.
Ranked #98 on
Semantic Segmentation
on NYU Depth v2
no code implementations • 28 Mar 2016 • Ishan Misra, C. Lawrence Zitnick, Martial Hebert
With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN).
Ranked #48 on
Self-Supervised Action Recognition
on HMDB51
2 code implementations • ACL 2016 • Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, Lucy Vanderwende
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images.
no code implementations • CVPR 2016 • Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, Ross Girshick
When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention.
no code implementations • CVPR 2015 • Ishan Misra, Abhinav Shrivastava, Martial Hebert
We present a semi-supervised approach that localizes multiple unknown object instances in long videos.
no code implementations • 21 May 2015 • Ishan Misra, Abhinav Shrivastava, Martial Hebert
We present a semi-supervised approach that localizes multiple unknown object instances in long videos.