no code implementations • ECCV 2020 • Yuan-Ting Hu, Heng Wang, Nicolas Ballas, Kristen Grauman, Alexander G. Schwing
Video inpainting is an important technique for a wide variety of applications from video content editing to video restoration.
no code implementations • 17 Apr 2025 • Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.
no code implementations • 18 Mar 2025 • Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman
When obtaining visual illustrations from text descriptions, today's methods take a description with-a single text context caption, or an action description-and retrieve or generate the matching visual context.
no code implementations • 24 Dec 2024 • Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman
We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video.
no code implementations • 3 Dec 2024 • Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman
The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning.
no code implementations • 1 Dec 2024 • Kumar Ashutosh, Georgios Pavlakos, Kristen Grauman
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'where' and 'how'.
no code implementations • 13 Nov 2024 • Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman
Given a multi-view video, which viewpoint is most informative for a human observer?
no code implementations • 17 Oct 2024 • Bolin Lai, Sam Toyer, Tushar Nagarajan, Rohit Girdhar, Shengxin Zha, James M. Rehg, Kris Kitani, Kristen Grauman, Ruta Desai, Miao Liu
Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions.
no code implementations • 1 Aug 2024 • Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman
Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections.
no code implementations • 13 Jun 2024 • Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
We propose a novel ambient-aware audio generation model, AV-LDM.
no code implementations • 11 Jun 2024 • Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman
We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image.
no code implementations • 5 May 2024 • Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman
We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation.
no code implementations • 24 Apr 2024 • Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location.
no code implementations • CVPR 2024 • Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
no code implementations • 11 Mar 2024 • Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
no code implementations • CVPR 2024 • Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman
We introduce the video detours problem for navigating instructional videos.
no code implementations • CVPR 2024 • Zihui Xue, Kumar Ashutosh, Kristen Grauman
Object State Changes (OSCs) are pivotal for video understanding.
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
no code implementations • CVPR 2024 • Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
no code implementations • 28 Jun 2023 • Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman
The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e. g., "where did I leave my purse?").
no code implementations • 3 Feb 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
With no modification to the baseline architectures, our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR keyframe localization challenge.
no code implementations • CVPR 2023 • Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi
We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
no code implementations • 18 Jan 2023 • Megan M. Baker, Alexander New, Mario Aguilar-Simon, Ziad Al-Halah, Sébastien M. R. Arnold, Ese Ben-Iwhiwhu, Andrew P. Brna, Ethan Brooks, Ryan C. Brown, Zachary Daniels, Anurag Daram, Fabien Delattre, Ryan Dellana, Eric Eaton, Haotian Fu, Kristen Grauman, Jesse Hostetler, Shariq Iqbal, Cassandra Kent, Nicholas Ketz, Soheil Kolouri, George Konidaris, Dhireesha Kudithipudi, Erik Learned-Miller, Seungwon Lee, Michael L. Littman, Sandeep Madireddy, Jorge A. Mendez, Eric Q. Nguyen, Christine D. Piatko, Praveen K. Pilly, Aswin Raghavan, Abrar Rahman, Santhosh Kumar Ramakrishnan, Neale Ratzlaff, Andrea Soltoggio, Peter Stone, Indranil Sur, Zhipeng Tang, Saket Tiwari, Kyle Vedder, Felix Wang, Zifan Xu, Angel Yanguas-Gil, Harel Yedidsion, Shangqun Yu, Gautam K. Vallabha
Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed.
no code implementations • 5 Jan 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies.
1 code implementation • CVPR 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text.
Ranked #3 on
Long Term Action Anticipation
on Ego4D
no code implementations • CVPR 2023 • Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu
Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way?
1 code implementation • CVPR 2023 • Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand.
no code implementations • CVPR 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e. g., classifying sports in one dataset, tracking animals in another).
1 code implementation • 8 Dec 2022 • Hanwen Jiang, Zhenyu Jiang, Kristen Grauman, Yuke Zhu
The reconstruction results under predicted poses are comparable to the ones using ground-truth poses.
no code implementations • 13 Oct 2022 • Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi, Sonia Raychaudhuri, Mike Roberts, Silvio Savarese, Manolis Savva, Mohit Shridhar, Niko Sünderhauf, Andrew Szot, Ben Talbot, Joshua B. Tenenbaum, Jesse Thomason, Alexander Toshev, Joanne Truong, Luca Weihs, Jiajun Wu
We present a retrospective on the state of Embodied AI research.
2 code implementations • 16 Jun 2022 • Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robinson, Kristen Grauman
We introduce SoundSpaces 2. 0, a platform for on-the-fly geometry-based audio rendering for 3D environments.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 8 Jun 2022 • Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics.
1 code implementation • CVPR 2022 • Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
no code implementations • CVPR 2022 • Ziad Al-Halah, Santhosh K. Ramakrishnan, Kristen Grauman
In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments.
1 code implementation • 2 Feb 2022 • Sagnik Majumder, Kristen Grauman
We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest.
no code implementations • 1 Feb 2022 • Priyanka Mandikal, Kristen Grauman
Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning.
Deep Reinforcement Learning
Human-Object Interaction Detection
+1
1 code implementation • CVPR 2022 • Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, Kristen Grauman
We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?'
no code implementations • 21 Nov 2021 • Rishabh Garg, Ruohan Gao, Kristen Grauman
Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings.
no code implementations • NeurIPS 2021 • Tushar Nagarajan, Kristen Grauman
For a given object, an activity-context prior represents the set of other compatible objects that are required for activities to succeed (e. g., a knife and cutting board brought together with a tomato are conducive to cutting).
8 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
no code implementations • ICLR 2022 • Santhosh Kumar Ramakrishnan, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman
We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents.
1 code implementation • 6 Jul 2021 • Sukjin Han, Eric H. Schulman, Kristen Grauman, Santhosh Ramakrishnan
We then study the causal effects of a merger on the merging firm's creative decisions using the constructed measures in a synthetic control method.
1 code implementation • 14 Jun 2021 • Changan Chen, Wei Sun, David Harwath, Kristen Grauman
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • ICCV 2021 • Rohit Girdhar, Kristen Grauman
We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions.
Ranked #2 on
Action Anticipation
on EPIC-KITCHENS-100 (test)
(using extra training data)
no code implementations • 20 May 2021 • Miao Liu, Lingni Ma, Kiran Somasundaram, Yin Li, Kristen Grauman, James M. Rehg, Chao Li
Given a video captured from a first person perspective and the environment context of where the video is recorded, can we recognize what the person is doing and identify where the action occurs in the 3D space?
no code implementations • ICCV 2021 • Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment.
1 code implementation • CVPR 2021 • Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
no code implementations • ICCV 2021 • Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer
We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
no code implementations • 3 Feb 2021 • Santhosh K. Ramakrishnan, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman
We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents.
no code implementations • ICCV 2021 • Wei-Lin Hsiao, Kristen Grauman
Fashion is intertwined with external cultural factors, but identifying these links remains a manual process limited to only the most salient phenomena.
1 code implementation • CVPR 2021 • Ruohan Gao, Kristen Grauman
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
1 code implementation • ICCV 2021 • Senthil Purushwalkam, Sebastian Vicenc Amengual Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Gupta, Kristen Grauman
Given only a few glimpses of an environment, how much can we infer about its entire floorplan?
no code implementations • CVPR 2021 • Changan Chen, Ziad Al-Halah, Kristen Grauman
We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target.
no code implementations • 4 Dec 2020 • Utkarsh Mall, Kavita Bala, Tamara Berg, Kristen Grauman
The fashion sense -- meaning the clothing styles people wear -- in a geographical region can reveal information about that region.
no code implementations • 17 Nov 2020 • Ziad Al-Halah, Kristen Grauman
The discovered influence relationships reveal how both cities and brands exert and receive fashion influence for an array of visual styles inferred from the images.
1 code implementation • 3 Sep 2020 • Priyanka Mandikal, Kristen Grauman
Our key idea is to embed an object-centric visual affordance model within a deep reinforcement learning loop to learn grasping policies that favor the same object regions favored by people.
1 code implementation • ECCV 2020 • Santhosh K. Ramakrishnan, Ziad Al-Halah, Kristen Grauman
State-of-the-art navigation methods leverage a spatial memory to generalize to new environments, but their occupancy maps are limited to capturing the geometric structures directly observed by the agent.
Ranked #3 on
Robot Navigation
on Habitat 2020 Point Nav test-std
1 code implementation • ICLR 2021 • Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e. g., a phone ringing in another room).
1 code implementation • NeurIPS 2020 • Tushar Nagarajan, Kristen Grauman
We introduce a reinforcement learning approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (such as an unfamiliar kitchen).
no code implementations • 29 Jun 2020 • Nicole D. Payntar, Wei-Lin Hsiao, R. Alan Covey, Kristen Grauman
The popularity of media sharing platforms in recent decades has provided an abundance of open source data that remains underutilized by heritage scholars.
no code implementations • ECCV 2020 • Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Several animal species (e. g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world.
1 code implementation • CVPR 2020 • Ziad Al-Halah, Kristen Grauman
The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively.
3 code implementations • 23 Jan 2020 • Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
1 code implementation • CVPR 2020 • Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman
We introduce a model for environment affordances that is learned directly from egocentric video.
1 code implementation • CVPR 2020 • Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, Deepti Ghadiyaram
Our key idea is to decorrelate feature representations of a category from its co-occurring context.
1 code implementation • 7 Jan 2020 • Santhosh K. Ramakrishnan, Dinesh Jayaraman, Kristen Grauman
Embodied computer vision considers perception for robots in novel, unstructured environments.
2 code implementations • ECCV 2020 • Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman
Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment.
no code implementations • CVPR 2020 • Wei-Lin Hsiao, Kristen Grauman
Body shape plays an important role in determining what garments will best suit a given person, yet today's clothing recommendation methods take a "one shape fits all" approach.
1 code implementation • CVPR 2020 • Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical.
Ranked #8 on
Action Recognition
on ActivityNet
1 code implementation • Science Robotics 2019 • Santhosh K. Ramakrishnan, Dinesh Jayaraman, Kristen Grauman
Standard computer vision systems assume access to intelligently captured inputs (e. g., photos from a human photographer), yet autonomously capturing good observations is a major challenge in itself.
no code implementations • 3 Jun 2019 • Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman
Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.
3 code implementations • CVPR 2021 • Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, Rogerio Feris
We provide a detailed analysis of the characteristics of the Fashion IQ data, and present a transformer-based user simulator and interactive image retriever that can seamlessly integrate visual attributes with image features, user feedback, and dialog history, leading to improved performance over the state of the art in dialog-based image retrieval.
no code implementations • 30 Apr 2019 • Danna Gurari, Yinan Zhao, Suyog Dutt Jain, Margrit Betke, Kristen Grauman
We propose a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images and automated methods.
1 code implementation • CVPR 2020 • Evonne Ng, Donglai Xiang, Hanbyul Joo, Kristen Grauman
The body pose of a person wearing a camera is of great interest for applications in augmented reality, healthcare, and robotics, yet much of the person's body is out of view for a typical wearable camera.
no code implementations • ICCV 2019 • Wei-Lin Hsiao, Isay Katsman, Chao-yuan Wu, Devi Parikh, Kristen Grauman
We introduce Fashion++, an approach that proposes minimal adjustments to a full-body clothing outfit that will have maximal impact on its fashionability.
3 code implementations • ICCV 2019 • Ruohan Gao, Kristen Grauman
Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel.
Ranked #1 on
Audio Denoising
on AV-Bench - Wooden Horse
no code implementations • 10 Apr 2019 • Antonino Furnari, Sebastiano Battiato, Kristen Grauman, Giovanni Maria Farinella
Although First Person Vision systems can sense the environment from the user's perspective, they are generally unable to predict his intentions and goals.
no code implementations • CVPR 2019 • Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, Kristen Grauman
Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos.
no code implementations • CVPR 2019 • Aron Yu, Kristen Grauman
Current wisdom suggests more labeled image data is always better, and obtaining labels is the bottleneck.
1 code implementation • CVPR 2019 • Zhenpei Yang, Jeffrey Z. Pan, Linjie Luo, Xiaowei Zhou, Kristen Grauman, Qi-Xing Huang
In particular, instead of only performing scene completion from each individual scan, our approach alternates between relative pose estimation and scene completion.
2 code implementations • CVPR 2019 • Ruohan Gao, Kristen Grauman
We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations.
1 code implementation • ICCV 2019 • Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman
Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements.
Ranked #3 on
Video-to-image Affordance Grounding
on EPIC-Hotspot
no code implementations • CVPR 2019 • Yu-Chuan Su, Kristen Grauman
KTNs efficiently transfer convolution kernels from perspective images to the equirectangular projection of 360{\deg} images.
3 code implementations • CVPR 2019 • Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, Rogerio Feris
Transfer learning, which allows a source task to affect the inductive bias of the target task, is widely used in computer vision.
no code implementations • ECCV 2018 • Ke Zhang, Kristen Grauman, Fei Sha
The key idea is to complement the discriminative losses with another loss which measures if the predicted summary preserves the same information as in the original video.
no code implementations • ECCV 2018 • Bo Xiong, Kristen Grauman
360° panoramas are a rich medium, yet notoriously difficult to visualize in the 2D image plane.
no code implementations • 11 Aug 2018 • Bo Xiong, Suyog Dutt Jain, Kristen Grauman
We propose an end-to-end learning framework for segmenting generic objects in both images and videos.
no code implementations • ECCV 2018 • Santhosh K. Ramakrishnan, Kristen Grauman
We consider an active visual exploration scenario, where an agent must intelligently select its camera motions to efficiently reconstruct the full environment from only a limited set of narrow field-of-view glimpses.
no code implementations • CVPR 2018 • Yu-Chuan Su, Kristen Grauman
Standard video encoders developed for conventional narrow field-of-view video are widely applied to 360° video as well, with reasonable results.
2 code implementations • ECCV 2018 • Ruohan Gao, Rogerio Feris, Kristen Grauman
Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video.
no code implementations • CVPR 2018 • Steven Chen, Kristen Grauman
We collect instance-level annotations of most noticeable differences, and build a model trained on relative attribute features that predicts prominent differences for unseen pairs.
no code implementations • 31 Mar 2018 • Bo Xiong, Kristen Grauman
360$^{\circ}$ panoramas are a rich medium, yet notoriously difficult to visualize in the 2D image plane.
1 code implementation • ECCV 2018 • Tushar Nagarajan, Kristen Grauman
In addition, we show that not only can our model recognize unseen compositions robustly in an open-world setting, it can also generalize to compositions where objects themselves were unseen during training.
Ranked #5 on
Image Retrieval with Multi-Modal Query
on MIT-States
1 code implementation • CVPR 2018 • Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, Jeffrey P. Bigham
The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings.
no code implementations • 12 Dec 2017 • Yu-Chuan Su, Kristen Grauman
Standard video encoders developed for conventional narrow field-of-view video are widely applied to 360{\deg} video as well, with reasonable results.
4 code implementations • CVPR 2018 • Ruohan Gao, Bo Xiong, Kristen Grauman
Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition.
no code implementations • CVPR 2018 • Wei-Lin Hsiao, Kristen Grauman
To permit efficient subset selection over the space of all outfit combinations, we develop submodular objective functions capturing the key ingredients of visual compatibility, versatility, and user-specific preference.
1 code implementation • CVPR 2018 • Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, Rogerio Feris
Very deep convolutional neural networks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications.
no code implementations • ECCV 2018 • Dinesh Jayaraman, Ruohan Gao, Kristen Grauman
We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation.
2 code implementations • CVPR 2018 • Dinesh Jayaraman, Kristen Grauman
It is common to implicitly assume access to intelligently captured inputs (e. g., photos from a human photographer), yet autonomously capturing good observations is itself a major challenge.
no code implementations • NeurIPS 2017 • Yu-Chuan Su, Kristen Grauman
While 360{\deg} cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial.
1 code implementation • ICCV 2017 • Wei-Lin Hsiao, Kristen Grauman
Given a collection of unlabeled fashion images, our approach mines for the latent styles, then summarizes outfits by how they mix those styles.
no code implementations • CVPR 2017 • Yu-Chuan Su, Kristen Grauman
360deg video requires human viewers to actively control "where" to look while watching the video.
no code implementations • CVPR 2017 • Suyog Dutt Jain, Bo Xiong, Kristen Grauman
Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos.
no code implementations • ICCV 2017 • Ziad Al-Halah, Rainer Stiefelhagen, Kristen Grauman
What is the future of fashion?
no code implementations • 30 Apr 2017 • Danna Gurari, Kun He, Bo Xiong, Jianming Zhang, Mehrnoosh Sameki, Suyog Dutt Jain, Stan Sclaroff, Margrit Betke, Kristen Grauman
We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems.
no code implementations • 1 Mar 2017 • Yu-Chuan Su, Kristen Grauman
360$^{\circ}$ video requires human viewers to actively control "where" to look while watching the video.
no code implementations • CVPR 2017 • Suyog Dutt Jain, Bo Xiong, Kristen Grauman
Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos.
no code implementations • 19 Jan 2017 • Suyog Dutt Jain, Bo Xiong, Kristen Grauman
We propose an end-to-end learning framework for generating foreground object segmentations.
no code implementations • ICCV 2017 • Aron Yu, Kristen Grauman
Distinguishing subtle differences in attributes is valuable, yet learning to make visual comparisons remains non-trivial.
no code implementations • 7 Dec 2016 • Yu-Chuan Su, Dinesh Jayaraman, Kristen Grauman
AutoCam leverages NFOV web video to discriminatively identify space-time "glimpses" of interest at each time instant, and then uses dynamic programming to select optimal human-like camera trajectories.
1 code implementation • ICCV 2017 • Ruohan Gao, Kristen Grauman
While machine learning approaches to image restoration offer great promise, current methods risk training models fixated on performing well only for image corruption of a particular level of difficulty---such as a certain level of noise or blur.
no code implementations • 1 Dec 2016 • Ruohan Gao, Dinesh Jayaraman, Kristen Grauman
Compared to existing temporal coherence methods, our idea has the advantage of lightweight preprocessing of the unlabeled video (no tracking required) while still being able to extract object-level regions from which to learn invariances.
no code implementations • 7 Nov 2016 • Adriana Kovashka, Olga Russakovsky, Li Fei-Fei, Kristen Grauman
Computer vision systems require large amounts of manually annotated data to properly learn challenging visual concepts.
no code implementations • 29 Aug 2016 • Danna Gurari, Kristen Grauman
Visual question answering (VQA) systems are emerging from a desire to empower users to ask any natural language question about visual content and receive a valid answer in response.
no code implementations • 11 Jul 2016 • Chao-Yeh Chen, Kristen Grauman
We show that this detection strategy permits an efficient branch-and-cut solution for the best-scoring---and possibly non-cubically shaped---portion of the video for a given activity classifier.
no code implementations • 5 Jul 2016 • Suyog Dutt Jain, Kristen Grauman
We present a novel form of interactive video object segmentation where a few clicks by the user helps the system produce a full spatio-temporal segmentation of the object of interest.
no code implementations • CVPR 2016 • Danna Gurari, Suyog Jain, Margrit Betke, Kristen Grauman
We propose a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images and automated methods.
no code implementations • CVPR 2016 • Suyog Dutt Jain, Kristen Grauman
We propose a semi-automatic method to obtain foreground object masks for a large set of related images.
1 code implementation • 26 May 2016 • Ke Zhang, Wei-Lun Chao, Fei Sha, Kristen Grauman
We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots.
no code implementations • 30 Apr 2016 • Dinesh Jayaraman, Kristen Grauman
To verify this hypothesis, we attempt to induce this capacity in our active recognition pipeline, by simultaneously learning to forecast the effects of the agent's motions on its internal representation of the environment conditional on all past views.
no code implementations • 17 Apr 2016 • Chao-Yeh Chen, Kristen Grauman
We propose to predict the "interactee" in novel images---that is, to localize the \emph{object} of a person's action.
no code implementations • CVPR 2017 • Hao Jiang, Kristen Grauman
In addition, we demonstrate its impact on a proxemics recognition task, which demands a precise representation of "whose body part is where" in crowded images.
no code implementations • 4 Apr 2016 • Yu-Chuan Su, Kristen Grauman
In a wearable camera video, we see what the camera wearer sees.
no code implementations • 1 Apr 2016 • Yu-Chuan Su, Kristen Grauman
Current approaches for activity recognition often ignore constraints on computational resources: 1) they rely on extensive feature computation to obtain rich descriptors on all frames, and 2) they assume batch-mode access to the entire test video at once.
no code implementations • CVPR 2017 • Hao Jiang, Kristen Grauman
We propose to infer the "invisible pose" of a person behind the egocentric camera.
no code implementations • CVPR 2016 • Ke Zhang, Wei-Lun Chao, Fei Sha, Kristen Grauman
Video summarization has unprecedented importance to help us digest, browse, and search today's ever-growing video collections.
no code implementations • ICCV 2015 • Aron Yu, Kristen Grauman
We develop a Bayesian local learning strategy to infer when images are indistinguishable for a given attribute.
no code implementations • CVPR 2016 • Dinesh Jayaraman, Kristen Grauman
While this standard approach captures the fact that high-level visual signals change slowly over time, it fails to capture *how* the visual content changes.
no code implementations • 18 May 2015 • Yong Jae Lee, Kristen Grauman
Our results on two egocentric video datasets show the method's promise relative to existing techniques for saliency and summarization.
no code implementations • 15 May 2015 • Adriana Kovashka, Kristen Grauman
We propose to discover shades of attribute meaning.
no code implementations • 15 May 2015 • Adriana Kovashka, Devi Parikh, Kristen Grauman
We propose a novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image sought.
1 code implementation • ICCV 2015 • Dinesh Jayaraman, Kristen Grauman
Understanding how images of objects and scenes behave in response to specific ego-motions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images.
no code implementations • NeurIPS 2014 • Boqing Gong, Wei-Lun Chao, Kristen Grauman, Fei Sha
Video summarization is a challenging problem with great application potential.
no code implementations • NeurIPS 2014 • Aron Yu, Kristen Grauman
Lazy local learning methods train a classifier on the fly" at test time, using only a subset of the training instances that are most relevant to the novel test example.
no code implementations • NeurIPS 2014 • Dinesh Jayaraman, Kristen Grauman
In principle, zero-shot learning makes it possible to train an object recognition model simply by specifying the category's attributes.
no code implementations • 6 Nov 2014 • Boqing Gong, Wei-Lun Chao, Kristen Grauman, Fei Sha
Extensive empirical studies validate our contributions, including applications on challenging document and video summarization, where flexibility in modeling the kernel matrix and balancing different errors is indispensable.
no code implementations • 15 Sep 2014 • Dinesh Jayaraman, Kristen Grauman
In principle, zero-shot learning makes it possible to train a recognition model simply by specifying the category's attributes.
no code implementations • CVPR 2014 • Lucy Liang, Kristen Grauman
It is useful to automatically compare images based on their visual properties---to predict which image is brighter, more feminine, more blurry, etc.
no code implementations • CVPR 2014 • Chao-Yeh Chen, Kristen Grauman
We pose unseen view synthesis as a probabilistic tensor completion problem.
no code implementations • CVPR 2014 • Dinesh Jayaraman, Fei Sha, Kristen Grauman
Existing methods to learn visual attributes are prone to learning the wrong thing---namely, properties that are correlated with the attribute of interest among training samples.
no code implementations • CVPR 2014 • Aron Yu, Kristen Grauman
Given two images, we want to predict which exhibits a particular visual attribute more than the other---even when the two images are quite similar.
no code implementations • CVPR 2014 • Chao-Yeh Chen, Kristen Grauman
The appearance of an attribute can vary considerably from class to class (e. g., a "fluffy" dog vs. a "fluffy" towel), making standard class-independent attribute models break down.
no code implementations • NeurIPS 2013 • Boqing Gong, Kristen Grauman, Fei Sha
By maximum distinctiveness, we require the underlying distributions of the identified domains to be different from each other; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain.
no code implementations • CVPR 2013 • Jaechul Kim, Ce Liu, Fei Sha, Kristen Grauman
We introduce a fast deformable spatial pyramid (DSP) matching algorithm for computing dense pixel correspondences.
no code implementations • CVPR 2013 • Zheng Lu, Kristen Grauman
We present a video summarization approach that discovers the story of an egocentric video.
no code implementations • CVPR 2013 • Chao-Yeh Chen, Kristen Grauman
We propose an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process.
no code implementations • NeurIPS 2012 • Sung Ju Hwang, Kristen Grauman, Fei Sha
When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient.
no code implementations • NeurIPS 2011 • Kristen Grauman, Fei Sha, Sung Ju Hwang
Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM).
no code implementations • NeurIPS 2010 • Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman
Our first approach maps the data to two-bit binary keys that are locality-sensitive for the angle between the hyperplane normal and a database point.
no code implementations • NeurIPS 2008 • Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman
Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once.
no code implementations • NeurIPS 2008 • Sudheendra Vijayanarasimhan, Kristen Grauman
We introduce a framework for actively learning visual categories from a mixture of weakly and strongly labeled image examples.