no code implementations • CVPR 2023 • Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, Simon Jenni
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications.
no code implementations • CVPR 2023 • Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell
A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.
1 code implementation • CVPR 2023 • Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens
Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like".
no code implementations • CVPR 2023 • Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko
We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data.
1 code implementation • 24 Oct 2022 • Hang Gao, RuiLong Li, Shubham Tulsiani, Bryan Russell, Angjoo Kanazawa
We study the recent progress on dynamic view synthesis (DVS) from monocular video.
no code implementations • CVPR 2022 • Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon
In order to capture the high-level concepts that are required to solve the task, we propose modeling the long-term temporal context of both the video and the music signals, using Transformer networks for each modality.
no code implementations • CVPR 2022 • Zhongzheng Ren, Aseem Agarwala, Bryan Russell, Alexander G. Schwing, Oliver Wang
We introduce an approach for selecting objects in neural volumetric 3D representations, such as multi-plane images (MPI) and neural radiance fields (NeRF).
1 code implementation • CVPR 2022 • Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Josef Sivic
We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.
no code implementations • NeurIPS 2021 • Reuben Tan, Bryan Plummer, Kate Saenko, Hailin Jin, Bryan Russell
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
1 code implementation • 12 Nov 2021 • Alex Andonian, Taesung Park, Bryan Russell, Phillip Isola, Jun-Yan Zhu, Richard Zhang
Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result.
no code implementations • 20 Oct 2021 • Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin Jin, Bryan Russell
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
1 code implementation • ICCV 2021 • Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell
Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object.
Human-Object Interaction Detection
Weakly-supervised Learning
1 code implementation • ICCV 2021 • Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, Bryan Russell
In this paper, we explore enabling user editing of a category-level NeRF - also known as a conditional radiance field - trained on a shape category.
Ranked #1 on
Novel View Synthesis
on PhotoShape
1 code implementation • ECCV 2020 • Davis Rempe, Leonidas J. Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, Jimei Yang
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles.
no code implementations • CVPR 2020 • Karren Yang, Bryan Russell, Justin Salamon
Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs.
no code implementations • ICCV 2019 • Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, Thomas Brox
We show that methods trained on our dataset consistently perform well when tested on other datasets.
Ranked #8 on
3D Hand Pose Estimation
on FreiHAND
no code implementations • ICCV 2019 • Carlo Innamorati, Bryan Russell, Danny M. Kaufman, and Niloy J. Mitra
We introduce a method to generate videos of dynamic virtual objects plausibly interacting via collisions with a still image's environment.
2 code implementations • 30 Jul 2019 • Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell
We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.
no code implementations • ICLR 2019 • Senthil Purushwalkam, Abhinav Gupta, Danny M. Kaufman, Bryan Russell
To achieve our results, we introduce the Bounce Dataset comprising 5K RGB-D videos of bouncing trajectories of a foam ball to probe surfaces of varying shapes and materials in everyday scenes including homes and offices.
no code implementations • 28 Feb 2019 • Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, Gautham J. Mysore
In video production, inserting B-roll is a widely used technique to enrich the story and make a video more engaging.
1 code implementation • EMNLP 2018 • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset.
2 code implementations • ECCV 2018 • Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid
Human shape estimation is an important task for video editing, animation and fashion industry.
Ranked #3 on
3D Human Pose Estimation
on Surreal
(using extra training data)
1 code implementation • 8 Aug 2017 • Zoya Bylinskii, Nam Wook Kim, Peter O'Donovan, Sami Alsheikh, Spandan Madan, Hanspeter Pfister, Fredo Durand, Bryan Russell, Aaron Hertzmann
Our models are neural networks trained on human clicks and importance annotations on hundreds of designs.
2 code implementations • ICCV 2017 • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment.
no code implementations • CVPR 2017 • Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video.
1 code implementation • 21 Feb 2017 • Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, Deva Ramanan
We explore design principles for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation.
no code implementations • NeurIPS 2016 • Peng Wang, Xiaohui Shen, Bryan Russell, Scott Cohen, Brian Price, Alan L. Yuille
This paper introduces an approach to regularize 2. 5D surface normal and depth predictions at each pixel given a single input image.
no code implementations • 21 Sep 2016 • Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, Deva Ramanan
We explore architectures for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation.
no code implementations • CVPR 2016 • Aayush Bansal, Bryan Russell, Abhinav Gupta
We introduce an approach that leverages surface normal predictions, along with appearance cues, to retrieve 3D models for objects depicted in 2D still images from a large CAD object library.
no code implementations • CVPR 2016 • Francisco Massa, Bryan Russell, Mathieu Aubry
This paper presents an end-to-end convolutional neural network (CNN) for 2D-3D exemplar detection.
no code implementations • ICCV 2015 • Mathieu Aubry, Bryan Russell
The rendered images are presented to a trained CNN and responses for different layers are studied with respect to the input scene factors.
no code implementations • NeurIPS 2012 • Jianxiong Xiao, Bryan Russell, Antonio Torralba
In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes.
no code implementations • NeurIPS 2009 • Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman
In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes.