1 code implementation • ACL 2022 • Xingyu Fu, Ben Zhou, Ishaan Chandratreya, Carl Vondrick, Dan Roth
Images are often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture.
1 code implementation • NAACL (ACL) 2022 • Xinya Du, Zixuan Zhang, Sha Li, Pengfei Yu, Hongwei Wang, Tuan Lai, Xudong Lin, Ziqi Wang, Iris Liu, Ben Zhou, Haoyang Wen, Manling Li, Darryl Hannan, Jie Lei, Hyounghun Kim, Rotem Dror, Haoyu Wang, Michael Regan, Qi Zeng, Qing Lyu, Charles Yu, Carl Edwards, Xiaomeng Jin, Yizhu Jiao, Ghazaleh Kazeminejad, Zhenhailong Wang, Chris Callison-Burch, Mohit Bansal, Carl Vondrick, Jiawei Han, Dan Roth, Shih-Fu Chang, Martha Palmer, Heng Ji
We introduce RESIN-11, a new schema-guided event extraction&prediction framework that can be applied to a large variety of newsworthy scenarios.
1 code implementation • 24 May 2023 • Rundi Wu, Ruoshi Liu, Carl Vondrick, Changxi Zheng
Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input.
1 code implementation • CVPR 2023 • Basile Van Hoorick, Pavel Tokmakov, Simon Stent, Jie Li, Carl Vondrick
Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems.
no code implementations • CVPR 2023 • Ruoshi Liu, Carl Vondrick
The relatively hot temperature of the human body causes people to turn into long-wave infrared light sources.
no code implementations • 13 Apr 2023 • Arjun Mani, Ishaan Preetam Chandratreya, Elliot Creager, Carl Vondrick, Richard Zemel
Modeling the mechanics of fluid in complex scenes is vital to applications in design, graphics, and robotics.
1 code implementation • 20 Mar 2023 • Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, Carl Vondrick
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image.
no code implementations • 14 Mar 2023 • Dídac Surís, Sachit Menon, Carl Vondrick
Answering visual queries is a complex task that requires both visual processing and reasoning.
Ranked #6 on
Video Question Answering
on NExT-QA
1 code implementation • 26 Jan 2023 • Scott Geng, Revant Teotia, Purva Tendulkar, Sachit Menon, Carl Vondrick
We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation.
no code implementations • CVPR 2023 • Ruoshi Liu, Sachit Menon, Chengzhi Mao, Dennis Park, Simon Stent, Carl Vondrick
Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow.
1 code implementation • 14 Dec 2022 • Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, Carl Vondrick
We apply this training loss to two adaption methods, model finetuning and visual prompt tuning.
no code implementations • 13 Dec 2022 • Lingyu Zhang, Chengzhi Mao, Junfeng Yang, Carl Vondrick
Even under adaptive attacks where the adversary knows our defense, our algorithm is still effective.
no code implementations • 12 Dec 2022 • Chengzhi Mao, Lingyu Zhang, Abhishek Joshi, Junfeng Yang, Hao Wang, Carl Vondrick
In this paper, we introduce a framework that uses the dense intrinsic constraints in natural images to robustify inference.
1 code implementation • CVPR 2023 • Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, Carl Vondrick
We propose a ``doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales.
no code implementations • 8 Dec 2022 • Sachit Menon, Ishaan Preetam Chandratreya, Carl Vondrick
Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision.
no code implementations • 5 Dec 2022 • Mia Chiquier, Carl Vondrick
The dataset consists of 12. 5 hours of synchronized video and surface electromyography (sEMG) data of 10 subjects performing various exercises.
no code implementations • 2 Dec 2022 • Hui Lu, Mia Chiquier, Carl Vondrick
We introduce a framework for navigating through cluttered environments by connecting multiple cameras together while simultaneously preserving privacy.
no code implementations • CVPR 2023 • Purva Tendulkar, Dídac Surís, Carl Vondrick
Towards this goal, we address the task of generating a virtual human -- hands and full body -- grasping everyday objects.
no code implementations • 13 Oct 2022 • Sachit Menon, Carl Vondrick
By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used.
no code implementations • 4 Oct 2022 • Dídac Surís, Carl Vondrick
We introduce a representation learning framework for spatial trajectories.
no code implementations • 19 Jul 2022 • Sachit Menon, David Blei, Carl Vondrick
Variational autoencoders (VAEs) suffer from posterior collapse, where the powerful neural networks used for modeling and inference optimize the objective without meaningfully using the latent representation.
no code implementations • 17 Jun 2022 • Ruoshi Liu, Sachit Menon, Chengzhi Mao, Dennis Park, Simon Stent, Carl Vondrick
Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow.
no code implementations • 17 Jun 2022 • Ruoshi Liu, Chengzhi Mao, Purva Tendulkar, Hao Wang, Carl Vondrick
Many machine learning methods operate by inverting a neural network at inference time, which has become a popular technique for solving inverse problems in computer vision, robotics, and graphics.
no code implementations • CVPR 2022 • Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon
In order to capture the high-level concepts that are required to solve the task, we propose modeling the long-term temporal context of both the video and the music signals, using Transformer networks for each modality.
1 code implementation • CVPR 2022 • Chengzhi Mao, Kevin Xia, James Wang, Hao Wang, Junfeng Yang, Elias Bareinboim, Carl Vondrick
Visual representations underlie object recognition tasks, but they often contain both robust and non-robust features.
no code implementations • CVPR 2022 • Basile Van Hoorick, Purva Tendulka, Didac Suris, Dennis Park, Simon Stent, Carl Vondrick
For computer vision systems to operate in dynamic situations, they need to be able to represent and reason about object permanence.
1 code implementation • 1 Mar 2022 • Xingyu Fu, Ben Zhou, Ishaan Preetam Chandratreya, Carl Vondrick, Dan Roth
For example, in Figure 1, we can find a way to identify the news articles related to the picture through segment-wise understandings of the signs, the buildings, the crowds, and more.
1 code implementation • CVPR 2022 • Will Price, Carl Vondrick, Dima Damen
Our lives can be seen as a complex weaving of activities; we switch from one activity to another, to maximise our achievements or in reaction to demands placed upon us.
no code implementations • ICLR 2022 • Mia Chiquier, Chengzhi Mao, Carl Vondrick
Automatic speech recognition systems have created exciting possibilities for applications, however they also enable opportunities for systematic eavesdropping.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • ICLR 2022 • Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
Ranked #3 on
Domain Generalization
on Stylized-ImageNet
1 code implementation • 11 Nov 2021 • Boyuan Chen, Robert Kwiatkowski, Carl Vondrick, Hod Lipson
Internal computational models of physical bodies are fundamental to the ability of robots and animals alike to plan and control their actions.
1 code implementation • CVPR 2021 • Dídac Surís, Ruoshi Liu, Carl Vondrick
We introduce a framework for learning from unlabeled video what is predictable in the future.
Representation Learning
Self-Supervised Action Recognition
+1
1 code implementation • NAACL 2021 • Haoyang Wen, Ying Lin, Tuan Lai, Xiaoman Pan, Sha Li, Xudong Lin, Ben Zhou, Manling Li, Haoyu Wang, Hongming Zhang, Xiaodong Yu, Alexander Dong, Zhenhailong Wang, Yi Fung, Piyush Mishra, Qing Lyu, D{\'\i}dac Sur{\'\i}s, Brian Chen, Susan Windisch Brown, Martha Palmer, Chris Callison-Burch, Carl Vondrick, Jiawei Han, Dan Roth, Shih-Fu Chang, Heng Ji
We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video).
1 code implementation • 17 May 2021 • Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick
Due to the many ways that robots use containers, we believe the box will have a number of applications in robotics.
1 code implementation • ICCV 2021 • Chengzhi Mao, Mia Chiquier, Hao Wang, Junfeng Yang, Carl Vondrick
We find that images contain intrinsic structure that enables the reversal of many adversarial attacks.
no code implementations • 1 Jan 2021 • Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, Shih-Fu Chang
Children acquire language subconsciously by observing the surrounding world and listening to descriptions.
1 code implementation • CVPR 2021 • Chengzhi Mao, Augustine Cha, Amogh Gupta, Hao Wang, Junfeng Yang, Carl Vondrick
We introduce a framework for learning robust visual representations that generalize to new viewpoints, backgrounds, and scene contexts.
Ranked #43 on
Image Classification
on ObjectNet
(using extra training data)
1 code implementation • CVPR 2022 • Dídac Surís, Dave Epstein, Carl Vondrick
Machine translation between many languages at once is highly challenging, since training with ground truth requires supervision between all language pairs, which is difficult to obtain.
1 code implementation • ICCV 2021 • Basile Van Hoorick, Carl Vondrick
The elementary operation of cropping underpins nearly every computer vision system, ranging from data augmentation and translation invariance to computational photography and representation learning.
1 code implementation • NeurIPS 2020 • Ruilin Xu, Rundi Wu, Yuko Ishiwaka, Carl Vondrick, Changxi Zheng
We introduce a deep learning model for speech denoising, a long-standing challenge in audio analysis arising in numerous applications.
1 code implementation • ECCV 2020 • Alex Andonian, Camilo Fosco, Mathew Monfort, Allen Lee, Rogerio Feris, Carl Vondrick, Aude Oliva
This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?
no code implementations • 22 Jul 2020 • Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, Shih-Fu Chang
Children acquire language subconsciously by observing the surrounding world and listening to descriptions.
1 code implementation • ECCV 2020 • Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, Carl Vondrick
Although deep networks achieve strong accuracy on a range of computer vision benchmarks, they remain vulnerable to adversarial attacks, where imperceptible input perturbations fool the network.
no code implementations • CVPR 2021 • Dave Epstein, Carl Vondrick
We introduce a framework that predicts the goals behind observable human action in video.
1 code implementation • ECCV 2020 • Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, Carl Vondrick
Language acquisition is the process of learning words from the surrounding scene.
1 code implementation • CVPR 2020 • Dave Epstein, Boyuan Chen, Carl Vondrick
We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks.
no code implementations • 15 Oct 2019 • Boyuan Chen, Shuran Song, Hod Lipson, Carl Vondrick
We train embodied agents to play Visual Hide and Seek where a prey must navigate in a simulated environment in order to avoid capture from a predator.
1 code implementation • NeurIPS 2019 • Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, Baishakhi Ray
Deep networks are well-known to be fragile to adversarial attacks.
no code implementations • ICLR 2019 • Te-Lin Wu, Jaedong Hwang, Jingyun Yang, Shaofan Lai, Carl Vondrick, Joseph J. Lim
A noisy and diverse demonstration set may hinder the performances of an agent aiming to acquire certain skills via imitation learning.
no code implementations • CVPR 2019 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid
This paper focuses on multi-person action forecasting in videos.
3 code implementations • ICCV 2019 • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.
Ranked #1 on
Action Classification
on YouCook2
1 code implementation • CVPR 2019 • Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, Shih-Fu Chang
Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.
Ranked #1 on
Phrase Grounding
on ReferIt
1 code implementation • ECCV 2018 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid
A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.
Ranked #14 on
Action Recognition
on AVA v2.1
1 code implementation • ECCV 2018 • Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, Kevin Murphy
We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision.
2 code implementations • ECCV 2018 • Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh Mcdermott, Antonio Torralba
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.
4 code implementations • 9 Jan 2018 • Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, Aude Oliva
We present the Moments in Time Dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds.
no code implementations • ICLR 2018 • Deniz Oktay, Carl Vondrick, Antonio Torralba
However, when a layer is removed, the model learns to produce a different image that still looks natural to an adversary, which is possible by removing objects.
no code implementations • 5 Dec 2017 • Kexin Pei, Linjie Zhu, Yinzhi Cao, Junfeng Yang, Carl Vondrick, Suman Jana
Finally, we show that retraining using the safety violations detected by VeriVis can reduce the average number of violations up to 60. 2%.
no code implementations • ICCV 2017 • Adria Recasens, Carl Vondrick, Aditya Khosla, Antonio Torralba
In this paper, we present an approach for following gaze in video by predicting where a person (in the video) is looking even when the object is in a different frame.
no code implementations • CVPR 2017 • Carl Vondrick, Antonio Torralba
We present a model that generates the future by transforming pixels in the past.
1 code implementation • 3 Jun 2017 • Yusuf Aytar, Carl Vondrick, Antonio Torralba
We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language.
6 code implementations • CVPR 2018 • Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #3 on
Temporal Action Localization
on UCF101-24
no code implementations • 9 Dec 2016 • Adrià Recasens, Carl Vondrick, Aditya Khosla, Antonio Torralba
In this paper, we present an approach for following gaze across views by predicting where a particular person is looking throughout a scene.
no code implementations • 4 Dec 2016 • Benjamin Eysenbach, Carl Vondrick, Antonio Torralba
We then create a representation of characters' beliefs for two tasks in human action understanding: predicting who is mistaken, and when they are mistaken.
6 code implementations • NeurIPS 2016 • Yusuf Aytar, Carl Vondrick, Antonio Torralba
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild.
no code implementations • 27 Oct 2016 • Yusuf Aytar, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba
Our experiments suggest that our scene representation can help transfer representations across modalities for retrieval.
no code implementations • NeurIPS 2016 • Carl Vondrick, Hamed Pirsiavash, Antonio Torralba
We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e. g. action classification) and video generation tasks (e. g. future prediction).
no code implementations • CVPR 2016 • Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba
Our experiments suggest that our scene representation can help transfer representations across modalities for retrieval.
no code implementations • NeurIPS 2015 • Adria Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba
Humans have the remarkable ability to follow the gaze of other people to identify what they are looking at.
no code implementations • CVPR 2016 • Carl Vondrick, Hamed Pirsiavash, Antonio Torralba
The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future.
no code implementations • 5 Mar 2015 • Xiangxin Zhu, Carl Vondrick, Charless Fowlkes, Deva Ramanan
Datasets for training object recognition systems are steadily increasing in size.
1 code implementation • 19 Feb 2015 • Carl Vondrick, Aditya Khosla, Hamed Pirsiavash, Tomasz Malisiewicz, Antonio Torralba
We introduce algorithms to visualize feature spaces used by object detectors.
no code implementations • NeurIPS 2015 • Carl Vondrick, Hamed Pirsiavash, Aude Oliva, Antonio Torralba
Although the human visual system can recognize many concepts under challenging conditions, it still has some biases.
no code implementations • CVPR 2016 • Carl Vondrick, Deniz Oktay, Hamed Pirsiavash, Antonio Torralba
In this paper, we introduce the problem of predicting why a person has performed an action in images.
no code implementations • 11 Dec 2012 • Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.
no code implementations • NeurIPS 2011 • Carl Vondrick, Deva Ramanan
We introduce a novel active learning framework for video annotation.