Search Results for author: Devi Parikh

Found 145 papers, 67 papers with code

Understanding the Intrinsic Memorability of Images

no code implementations NeurIPS 2011 Phillip Isola, Devi Parikh, Antonio Torralba, Aude Oliva

Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember.

feature selection

Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

no code implementations CVPR 2013 Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning.

Image Segmentation object-detection +5

Bringing Semantics into Focus Using Visual Abstraction

no code implementations CVPR 2013 C. L. Zitnick, Devi Parikh

Importantly, abstract images also allow the ability to generate sets of semantically similar scenes.

Attribute Semantic Similarity +1

Predicting Failures of Vision Systems

no code implementations CVPR 2014 Peng Zhang, Jiuling Wang, Ali Farhadi, Martial Hebert, Devi Parikh

We show that a surprisingly straightforward and general approach, that we call ALERT, can predict the likely accuracy (or failure) of a variety of computer vision systems – semantic segmentation, vanishing point and camera parameter estimation, and image memorability prediction – on individual input images.

Attribute Semantic Segmentation +1

Human-Machine CRFs for Identifying Bottlenecks in Holistic Scene Understanding

no code implementations16 Jun 2014 Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh

Recent trends in image understanding have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers.

Object object-detection +4

Image Specificity

no code implementations CVPR 2015 Mainak Jas, Devi Parikh

For some images, descriptions written by multiple people are consistent with each other.

Image Retrieval Retrieval +1

Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks

no code implementations CVPR 2015 Xiao Lin, Devi Parikh

But much of common sense knowledge is unwritten - partly because it tends not to be interesting enough to talk about, and partly because some common sense is unnatural to articulate in text.

Common Sense Reasoning

Understanding Image Virality

no code implementations CVPR 2015 Arturo Deza, Devi Parikh

We train classifiers with state-of-the-art image features to predict virality of individual images, relative virality in pairs of images, and the dominant topic of a viral image.

Attribute Marketing

WhittleSearch: Interactive Image Search with Relative Attribute Feedback

no code implementations15 May 2015 Adriana Kovashka, Devi Parikh, Kristen Grauman

We propose a novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image sought.

Attribute Image Retrieval

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

1 code implementation CVPR 2016 Satwik Kottur, Ramakrishna Vedantam, José M. F. Moura, Devi Parikh

While word embeddings trained using text have been extremely successful, they cannot uncover notions of semantic relatedness implicit in our visual world.

Common Sense Reasoning Image Retrieval +3

Counting Everyday Objects in Everyday Scenes

1 code implementation CVPR 2017 Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh

In this work, we build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes.

Object Object Counting +4

Joint Unsupervised Learning of Deep Representations and Image Clusters

3 code implementations CVPR 2016 Jianwei Yang, Devi Parikh, Dhruv Batra

In this paper, we propose a recurrent framework for Joint Unsupervised LEarning (JULE) of deep representations and image clusters.

Clustering Image Clustering +1

Leveraging Visual Question Answering for Image-Caption Ranking

no code implementations4 May 2016 Xiao Lin, Devi Parikh

This allows the model to interpret images and captions from a wide variety of perspectives.

Image Retrieval Question Answering +2

Hierarchical Question-Image Co-Attention for Visual Question Answering

9 code implementations NeurIPS 2016 Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Visual Dialog Visual Question Answering

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

no code implementations EMNLP 2016 Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, Dhruv Batra

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images.

Question Answering Visual Question Answering

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

no code implementations17 Jun 2016 Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, Dhruv Batra

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images.

Question Answering Visual Question Answering

Analyzing the Behavior of Visual Question Answering Models

1 code implementation EMNLP 2016 Aishwarya Agrawal, Dhruv Batra, Devi Parikh

Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA).

Question Answering Visual Question Answering

Deep Learning the City : Quantifying Urban Perception At A Global Scale

no code implementations5 Aug 2016 Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, César A. Hidalgo

Computer vision methods that quantify the perception of urban environment are increasingly being used to study the relationship between a city's physical appearance and the behavior and health of its residents.

General Classification

Grad-CAM: Why did you say that?

2 code implementations22 Nov 2016 Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra

We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations.

Image Captioning Visual Question Answering

Visual Dialog

11 code implementations CVPR 2017 Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra

We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content.

Chatbot Retrieval +1

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

7 code implementations CVPR 2017 Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher

The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation.

Image Captioning Language Modelling

Context-aware Captions from Context-agnostic Supervision

1 code implementation CVPR 2017 Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, Gal Chechik

We introduce an inference technique to produce discriminative context-aware image captions (captions that describe differences between images or visual concepts) using only generic context-agnostic training data (captions that describe a concept or an image in isolation).

Image Captioning Language Modelling

LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation

1 code implementation5 Mar 2017 Jianwei Yang, Anitha Kannan, Dhruv Batra, Devi Parikh

We present LR-GAN: an adversarial image generation model which takes scene structure and context into account.

Image Generation

Sound-Word2Vec: Learning Word Representations Grounded in Sounds

no code implementations EMNLP 2017 Ashwin K. Vijayakumar, Ramakrishna Vedantam, Devi Parikh

In this work, we treat sound as a first-class citizen, studying downstream textual tasks which require aural grounding.

Retrieval Word Embeddings

It Takes Two to Tango: Towards Theory of AI's Mind

no code implementations3 Apr 2017 Arjun Chandrasekaran, Deshraj Yadav, Prithvijit Chattopadhyay, Viraj Prabhu, Devi Parikh

Surprisingly, we find that having access to the model's internal states - its confidence in its top-k predictions, explicit or implicit attention maps which highlight regions in the image (and words in the question) the model is looking at (and listening to) while answering a question about an image - do not help people better predict its behavior.

Attribute Question Answering +2

C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

no code implementations26 Apr 2017 Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, Devi Parikh

Finally, we evaluate several existing VQA models under this new setting and show that the performances of these models degrade by a significant amount compared to the original VQA setting.

Question Answering Visual Question Answering

Punny Captions: Witty Wordplay in Image Descriptions

1 code implementation NAACL 2018 Arjun Chandrasekaran, Devi Parikh, Mohit Bansal

Wit is a form of rich interaction that is often grounded in a specific situation (e. g., a comment in response to an event).

Cooperative Learning with Visual Attributes

no code implementations16 May 2017 Tanmay Batra, Devi Parikh

Learning paradigms involving varying levels of supervision have received a lot of interest within the computer vision and machine learning communities.

ParlAI: A Dialog Research Software Platform

22 code implementations EMNLP 2017 Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston

We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.

reinforcement-learning Reinforcement Learning (RL) +1

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

1 code implementation NeurIPS 2017 Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra

In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.

Informativeness Metric Learning +2

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

2 code implementations16 Jun 2017 Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, Dhruv Batra

Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions.

Deal or No Deal? End-to-End Learning of Negotiation Dialogues

no code implementations EMNLP 2017 Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, Dhruv Batra

Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions.

Active Learning for Visual Question Answering: An Empirical Study

1 code implementation6 Nov 2017 Xiao Lin, Devi Parikh

We present an empirical study of active learning for Visual Question Answering, where a deep VQA model selects informative question-image pairs from a pool and queries an oracle for answers to maximally improve its performance under a limited query budget.

Active Learning Visual Question Answering

Embodied Question Answering

4 code implementations CVPR 2018 Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?").

Embodied Question Answering Navigate +3

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

1 code implementation CVPR 2018 Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi

Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively).

Question Answering Visual Question Answering

Neural Baby Talk

1 code implementation CVPR 2018 Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image.

Image Captioning Object +3

Talk the Walk: Navigating New York City through Grounded Dialogue

1 code implementation9 Jul 2018 Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, Douwe Kiela

We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception.

Navigate

Pythia v0.1: the Winning Entry to the VQA Challenge 2018

9 code implementations26 Jul 2018 Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, Devi Parikh

We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.

Data Augmentation Visual Question Answering (VQA)

Graph R-CNN for Scene Graph Generation

3 code implementations ECCV 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.

Graph Generation Scene Graph Generation

Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance

1 code implementation ECCV 2018 Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee

Our approach, which we call Neuron Importance-AwareWeight Transfer (NIWT), learns to map domain knowledge about novel "unseen" classes onto this dictionary of learned concepts and then optimizes for network parameters that can effectively combine these concepts - essentially learning classifiers by discovering and composing learned semantic concepts in deep networks.

Generalized Zero-Shot Learning

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

no code implementations1 Oct 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.

Question Generation Question-Generation

TarMAC: Targeted Multi-Agent Communication

no code implementations ICLR 2019 Abhishek Das, Théophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Michael Rabbat, Joelle Pineau

We propose a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments.

Multi-agent Reinforcement Learning

Neural Modular Control for Embodied Question Answering

2 code implementations26 Oct 2018 Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

We use imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning.

Embodied Question Answering Imitation Learning +3

Do Explanations make VQA Models more Predictable to a Human?

no code implementations EMNLP 2018 Arjun Chandrasekaran, Viraj Prabhu, Deshraj Yadav, Prithvijit Chattopadhyay, Devi Parikh

A rich line of research attempts to make deep neural networks more transparent by generating human-interpretable 'explanations' of their decision process, especially for interactive tasks like Visual Question Answering (VQA).

Question Answering Visual Question Answering

nocaps: novel object captioning at scale

2 code implementations ICCV 2019 Harsh Agrawal, Karan Desai, YuFei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson

To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task.

Image Captioning Object +2

Response to "Visual Dialogue without Vision or Dialogue" (Massiceti et al., 2018)

no code implementations16 Jan 2019 Abhishek Das, Devi Parikh, Dhruv Batra

In a recent workshop paper, Massiceti et al. presented a baseline model and subsequent critique of Visual Dialog (Das et al., CVPR 2017) that raises what we believe to be unfounded concerns about the dataset and evaluation.

Visual Dialog

Embodied Multimodal Multitask Learning

no code implementations4 Feb 2019 Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, Dhruv Batra

In this paper, we propose a multitask model capable of jointly learning these multimodal tasks, and transferring knowledge of words and their grounding in visual objects across the tasks.

Disentanglement Embodied Question Answering +3

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

no code implementations ICCV 2019 Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh

Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image.

Image Captioning Question Answering +2

Cycle-Consistency for Robust Visual Question Answering

no code implementations CVPR 2019 Meet Shah, Xinlei Chen, Marcus Rohrbach, Devi Parikh

Despite significant progress in Visual Question Answering over the years, robustness of today's VQA models leave much to be desired.

Question Answering Question Generation +2

Lemotif: An Affective Visual Journal Using Deep Neural Networks

1 code implementation18 Mar 2019 X. Alice Li, Devi Parikh

We present Lemotif, an integrated natural language processing and image generation system that uses machine learning to (1) parse a text-based input journal entry describing the user's day for salient themes and emotions and (2) visualize the detected themes and emotions in creative and appealing image motifs.

Image Generation

Trick or TReAT: Thematic Reinforcement for Artistic Typography

1 code implementation19 Mar 2019 Purva Tendulkar, Kalpesh Krishna, Ramprasaath R. Selvaraju, Devi Parikh

An approach to make text visually appealing and memorable is semantic reinforcement - the use of visual cues alluding to the context or theme in which the word is being used to reinforce the message (e. g., Google Doodles).

Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

no code implementations CVPR 2019 Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D).

Embodied Question Answering Question Answering

Embodied Visual Recognition

no code implementations9 Apr 2019 Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David Crandall, Devi Parikh, Dhruv Batra

Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded.

Object Object Localization +1

Counterfactual Visual Explanations

1 code implementation16 Apr 2019 Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, Stefan Lee

In this work, we develop a technique to produce counterfactual visual explanations.

counterfactual General Classification +1

Fashion++: Minimal Edits for Outfit Improvement

no code implementations ICCV 2019 Wei-Lin Hsiao, Isay Katsman, Chao-yuan Wu, Devi Parikh, Kristen Grauman

We introduce Fashion++, an approach that proposes minimal adjustments to a full-body clothing outfit that will have maximal impact on its fashionability.

Image Generation

Emergence of Compositional Language with Deep Generational Transmission

1 code implementation ICLR 2020 Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra

In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language.

Reinforcement Learning (RL)

Cross-Task Knowledge Transfer for Visually-Grounded Navigation

no code implementations ICLR 2019 Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, Dhruv Batra

Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for two different tasks: learning to follow navigational instructions and embodied question answering.

Disentanglement Embodied Question Answering +3

Improving Generative Visual Dialog by Answering Diverse Questions

1 code implementation IJCNLP 2019 Vishvak Murahari, Prithvijit Chattopadhyay, Dhruv Batra, Devi Parikh, Abhishek Das

Prior work on training generative Visual Dialog models with reinforcement learning(Das et al.) has explored a Qbot-Abot image-guessing game and shown that this 'self-talk' approach can lead to improved performance at the downstream dialog-conditioned image-guessing task.

Representation Learning Visual Dialog

DS-VIC: Unsupervised Discovery of Decision States for Transfer in RL

no code implementations25 Sep 2019 Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra, Ramakrishna Vedantam

We learn to identify decision states, namely the parsimonious set of states where decisions meaningfully affect the future states an agent can reach in an environment.

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

8 code implementations ICLR 2020 Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra

We leverage this scaling to train an agent for 2. 5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.

Autonomous Navigation Navigate +2

Cross-channel Communication Networks

1 code implementation NeurIPS 2019 Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, Devi Parikh

Convolutional neural networks process input data by sending channel-wise feature response maps to subsequent layers.

12-in-1: Multi-Task Vision and Language Representation Learning

5 code implementations CVPR 2020 Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.

Image Retrieval Question Answering +3

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

2 code implementations ECCV 2020 Vishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das

Next, we find that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG -- more than 10% over our base model -- but hurts MRR -- more than 17% below our base model!

Language Modelling Representation Learning +2

SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

no code implementations CVPR 2020 Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar

We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split.

Visual Question Answering (VQA)

Predicting A Creator's Preferences In, and From, Interactive Generative Art

no code implementations3 Mar 2020 Devi Parikh

These preferences could be in the specific generative art form (e. g., color palettes, density of the piece, thickness or curvatures of any lines in the piece); predicting them could lead to a smarter interactive tool.

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

no code implementations19 Apr 2020 Amanpreet Singh, Vedanuj Goswami, Devi Parikh

Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e. g., VQA v2) is a better choice for pretraining than "natural" data but of a slightly different domain (e. g., Conceptual Captions).

Visual Question Answering (VQA)

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

1 code implementation ECCV 2020 Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e. g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs').

Vision and Language Navigation

Exploring Crowd Co-creation Scenarios for Sketches

no code implementations15 May 2020 Devi Parikh, C. Lawrence Zitnick

As a first step towards studying the ability of human crowds and machines to effectively co-create, we explore several human-only collaborative co-creation scenarios.

Extended Abstract: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

no code implementations ICML Workshop LaReL 2020 Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Following a navigation instruction such as 'Walk down the stairs and stop near the sofa' requires an agent to ground scene elements referenced via language (e. g.'stairs') to visual content in the environment (pixels corresponding to 'stairs').

Vision and Language Navigation

Feel The Music: Automatically Generating A Dance For An Input Song

1 code implementation21 Jun 2020 Purva Tendulkar, Abhishek Das, Aniruddha Kembhavi, Devi Parikh

We encode intuitive, flexible heuristics for what a 'good' dance is: the structure of the dance should align with the structure of the music.

Neuro-Symbolic Generative Art: A Preliminary Study

no code implementations4 Jul 2020 Gunjan Aggarwal, Devi Parikh

There are two classes of generative art approaches: neural, where a deep model is trained to generate samples from a data distribution, and symbolic or algorithmic, where an artist designs the primary parameters and an autonomous system generates samples within these constraints.

Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents

no code implementations7 Sep 2020 Samyak Datta, Oleksandr Maksymets, Judy Hoffman, Stefan Lee, Dhruv Batra, Devi Parikh

This enables a seamless adaption to changing dynamics (a different robot or floor type) by simply re-calibrating the visual odometry model -- circumventing the expense of re-training of the navigation policy.

Navigate Robot Navigation +1

Contrast and Classify: Training Robust VQA Models

1 code implementation ICCV 2021 Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal

Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions.

Contrastive Learning Data Augmentation +4

The Open Catalyst 2020 (OC20) Dataset and Community Challenges

5 code implementations20 Oct 2020 Lowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, Muhammed Shuaibi, Morgane Riviere, Kevin Tran, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Aini Palizhati, Anuroop Sriram, Brandon Wood, Junwoong Yoon, Devi Parikh, C. Lawrence Zitnick, Zachary Ulissi

Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production.

SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

1 code implementation NAACL 2021 Sameer Dharur, Purva Tendulkar, Dhruv Batra, Devi Parikh, Ramprasaath R. Selvaraju

Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong.

Question Answering Visual Grounding +1

Sim-to-Real Transfer for Vision-and-Language Navigation

1 code implementation7 Nov 2020 Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee

We study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions.

Vision and Language Navigation

Where Are You? Localization from Embodied Dialog

2 code implementations EMNLP 2020 Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson

In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices.

Navigate Visual Dialog

Object-Centric Diagnosis of Visual Reasoning

no code implementations21 Dec 2020 Jianwei Yang, Jiayuan Mao, Jiajun Wu, Devi Parikh, David D. Cox, Joshua B. Tenenbaum, Chuang Gan

In contrast, symbolic and modular models have a relatively better grounding and robustness, though at the cost of accuracy.

Object Question Answering +2

ForceNet: A Graph Neural Network for Large-Scale Quantum Chemistry Simulation

no code implementations1 Jan 2021 Weihua Hu, Muhammed Shuaibi, Abhishek Das, Siddharth Goyal, Anuroop Sriram, Jure Leskovec, Devi Parikh, Larry Zitnick

We use ForceNet to perform quantum chemistry simulations, where ForceNet is able to achieve 4x higher success rate than existing ML models.

Atomic Forces

VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator

1 code implementation Findings (ACL) 2022 Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson Piramuthu, Gokhan Tür, Devi Parikh, Dilek Hakkani-Tür

Interactive robots navigating photo-realistic environments need to be trained to effectively leverage and handle the dynamic nature of dialogue in addition to the challenges underlying vision-and-language navigation (VLN).

Binary Classification Imitation Learning +3

Human-Adversarial Visual Question Answering

no code implementations NeurIPS 2021 Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez Magana, Wojciech Galuba, Devi Parikh, Douwe Kiela

Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the model's predicted answer is incorrect.

Question Answering Visual Question Answering

Building Bridges: Generative Artworks to Explore AI Ethics

no code implementations25 Jun 2021 Ramya Srinivasan, Devi Parikh

In recent years, there has been an increased emphasis on understanding and mitigating adverse impacts of artificial intelligence (AI) technologies on society.

Ethics

Visual Conceptual Blending with Large-scale Language and Vision Models

no code implementations27 Jun 2021 Songwei Ge, Devi Parikh

We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts?

Image Generation Language Modelling +2

Telling Creative Stories Using Generative Visual Aids

no code implementations27 Oct 2021 Safinah Ali, Devi Parikh

Can visual artworks created using generative visual algorithms inspire human creativity in storytelling?

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

1 code implementation24 Mar 2022 Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman

Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains.

Ranked #20 on Text-to-Image Generation on MS COCO (using extra training data)

Semantic Segmentation Text-to-Image Generation

Episodic Memory Question Answering

no code implementations CVPR 2022 Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, Devi Parikh

Towards that end, we introduce (1) a new task - Episodic Memory Question Answering (EMQA) wherein an egocentric AI assistant is provided with a video sequence (the tour) and a question as an input and is asked to localize its answer to the question within the tour, (2) a dataset of grounded questions designed to probe the agent's spatio-temporal understanding of the tour, and (3) a model for the task that encodes the scene as an allocentric, top-down semantic feature map and grounds the question into the map to localize the answer.

Question Answering

Make-A-Video: Text-to-Video Generation without Text-Video Data

2 code implementations29 Sep 2022 Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman

We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).

Ranked #3 on Text-to-Video Generation on MSR-VTT (CLIP-FID metric)

Image Generation Super-Resolution +2

AudioGen: Textually Guided Audio Generation

1 code implementation30 Sep 2022 Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.

Audio Generation Descriptive

SpaText: Spatio-Textual Representation for Controllable Image Generation

no code implementations CVPR 2023 Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, Xi Yin

Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based.

Text-to-Image Generation

Text-To-4D Dynamic Scene Generation

no code implementations26 Jan 2023 Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman

We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.

Scene Generation

Text-Conditional Contextualized Avatars For Zero-Shot Personalization

no code implementations14 Apr 2023 Samaneh Azadi, Thomas Hayes, Akbar Shah, Guan Pang, Devi Parikh, Sonal Gupta

Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language.

Text to 3D Text-to-Image Generation

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

no code implementations ICCV 2023 Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, Sonal Gupta

However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts.

Motion Synthesis Text-to-Video Generation +1

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

no code implementations16 Nov 2023 Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman

Lastly, to facilitate a more rigorous and informed assessment of instructable image editing models, we release a new challenging and versatile benchmark that includes seven different image editing tasks.

Image Inpainting Multi-Task Learning +1

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

no code implementations17 Nov 2023 Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image.

Text-to-Video Generation Video Generation

Video Editing via Factorized Diffusion Distillation

no code implementations14 Mar 2024 Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data.

Video Editing Video Generation

Cannot find the paper you are looking for? You can Submit a new open access paper.