Search Results for author: Ishan Misra

Found 61 papers, 38 papers with code

Generating Illustrated Instructions

no code implementations7 Dec 2023 Sachit Menon, Ishan Misra, Rohit Girdhar

We introduce the new task of generating Illustrated Instructions, i. e., visual instructions customized to a user's needs.

Text-to-Image Generation

On Bringing Robots Home

1 code implementation27 Nov 2023 Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, Lerrel Pinto

We use the Stick to collect 13 hours of data in 22 homes of New York City, and train Home Pretrained Representations (HPR).

SelfEval: Leveraging the discriminative nature of generative models for evaluation

no code implementations17 Nov 2023 Sai Saketh Rambhatla, Ishan Misra

In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner.

Attribute Visual Reasoning

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

no code implementations17 Nov 2023 Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image.

Text-to-Video Generation Video Generation

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

1 code implementation28 Aug 2023 Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions.

Instance Segmentation Optical Flow Estimation +5

GeneCIS: A Benchmark for General Conditional Image Similarity

no code implementations CVPR 2023 Sagar Vaze, Nicolas Carion, Ishan Misra

In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions.

Image Retrieval Representation Learning

ImageBind: One Embedding Space To Bind Them All

1 code implementation CVPR 2023 Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.

Cross-Modal Retrieval Retrieval +7

MOST: Multiple Object localization with Self-supervised Transformers for object discovery

no code implementations ICCV 2023 Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, Abhinav Shrivastava

In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images.

Object object-detection +6

Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities

no code implementations26 Jan 2023 Melissa Hall, Laura Gustafson, Aaron Adcock, Ishan Misra, Candace Ross

With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation?

Image Classification object-detection +4

A Simple Recipe for Competitive Low-compute Self supervised Vision Models

no code implementations23 Jan 2023 Quentin Duval, Ishan Misra, Nicolas Ballas

Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model.

Knowledge Distillation

The Hidden Uniform Cluster Prior in Self-Supervised Learning

no code implementations13 Oct 2022 Mahmoud Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Nicolas Ballas

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e. g., SimCLR, VICReg, SwAV, MSN).

Clustering Representation Learning +1

MonoNeRF: Learning Generalizable NeRFs from Monocular Videos without Camera Pose

no code implementations13 Oct 2022 Yang Fu, Ishan Misra, Xiaolong Wang

We propose a generalizable neural radiance fields - MonoNeRF, that can be trained on large-scale monocular videos of moving in static scenes without any ground-truth annotations of depth and camera poses.

Depth Estimation Disentanglement +2

OmniMAE: Single Model Masked Pretraining on Images and Videos

1 code implementation CVPR 2023 Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures.

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

1 code implementation16 Feb 2022 Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Ishan Misra, Levent Sagun, Armand Joulin, Piotr Bojanowski

Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images.

 Ranked #1 on Copy Detection on Copydays strong subset (using extra training data)

Action Classification Action Recognition +12

A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments

no code implementations16 Feb 2022 Randall Balestriero, Ishan Misra, Yann Lecun

We show that for a training loss to be stable under DA sampling, the model's saliency map (gradient of the loss with respect to the model's input) must align with the smallest eigenvector of the sample variance under the considered DA augmentation, hinting at a possible explanation on why models tend to shift their focus from edges to textures.

Data Augmentation

Omnivore: A Single Model for Many Visual Modalities

2 code implementations CVPR 2022 Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.

 Ranked #1 on Scene Recognition on SUN-RGBD (using extra training data)

Action Classification Action Recognition +3

Detecting Twenty-thousand Classes using Image-level Supervision

1 code implementation7 Jan 2022 Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.

Image Classification Open Vocabulary Object Detection

Mask2Former for Video Instance Segmentation

5 code implementations20 Dec 2021 Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.

Image Segmentation Instance Segmentation +5

3D Spatial Recognition without Spatially Labeled 3D

1 code implementation CVPR 2021 Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar

We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition, requiring only scene-level class tags as supervision.

3D Object Detection Multiple Instance Learning +5

Emerging Properties in Self-Supervised Vision Transformers

26 code implementations ICCV 2021 Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).

Copy Detection Image Retrieval +7

Robust Audio-Visual Instance Discrimination

no code implementations CVPR 2021 Pedro Morgado, Ishan Misra, Nuno Vasconcelos

Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives.

Action Recognition Contrastive Learning +2

Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning

1 code implementation ICCV 2021 Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi

First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.

Representation Learning Self-Supervised Learning

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

24 code implementations4 Mar 2021 Jure Zbontar, Li Jing, Ishan Misra, Yann Lecun, Stéphane Deny

This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.

General Classification Object Detection +3

Self-Supervised Pretraining of 3D Features on any Point-Cloud

1 code implementation ICCV 2021 Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc.

Object object-detection +4

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

16 code implementations NeurIPS 2020 Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin

In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much.

Contrastive Learning Data Augmentation +2

In Defense of Grid Features for Visual Question Answering

2 code implementations CVPR 2020 Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA).

Image Captioning Question Answering +1

ClusterFit: Improving Generalization of Visual Representations

1 code implementation CVPR 2020 Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, Dhruv Mahajan

Pre-training convolutional neural networks with weakly-supervised and self-supervised strategies is becoming increasingly popular for several computer vision tasks.

Action Classification Clustering +2

Self-Supervised Learning of Pretext-Invariant Representations

7 code implementations CVPR 2020 Ishan Misra, Laurens van der Maaten

The goal of self-supervised learning from images is to construct image representations that are semantically meaningful via pretext tasks that do not require semantic annotations for a large training set of images.

Contrastive Learning object-detection +5

Does Object Recognition Work for Everyone?

no code implementations6 Jun 2019 Terrance DeVries, Ishan Misra, Changhan Wang, Laurens van der Maaten

The paper analyzes the accuracy of publicly available object-recognition systems on a geographically diverse dataset.

Object Object Recognition

Evaluating Text-to-Image Matching using Binary Image Selection (BISON)

no code implementations19 Jan 2019 Hexiang Hu, Ishan Misra, Laurens van der Maaten

Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision.

Image Captioning Image Retrieval +1

Learning by Asking Questions

no code implementations CVPR 2018 Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten

We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.

Question Answering Visual Question Answering

From Red Wine to Red Tomato: Composition With Context

no code implementations CVPR 2017 Ishan Misra, Abhinav Gupta, Martial Hebert

In this paper, we present a simple method that respects contextuality in order to compose classifiers of known visual concepts.

Generating Natural Questions About an Image

2 code implementations ACL 2016 Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, Lucy Vanderwende

There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images.

Image Captioning Natural Questions +3

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

no code implementations CVPR 2016 Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, Ross Girshick

When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention.

Image Captioning Image Classification

Watch and Learn: Semi-Supervised Learning of Object Detectors from Videos

no code implementations21 May 2015 Ishan Misra, Abhinav Shrivastava, Martial Hebert

We present a semi-supervised approach that localizes multiple unknown object instances in long videos.

Object object-detection +1

Cannot find the paper you are looking for? You can Submit a new open access paper.