no code implementations • 26 Nov 2024 • Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon
MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning.
no code implementations • 19 Nov 2024 • Alejandro Pardo, Jui-Hsien Wang, Bernard Ghanem, Josef Sivic, Bryan Russell, Fabian Caba Heilbron
The objective of this work is to manipulate visual timelines (e. g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users.
no code implementations • 6 May 2024 • Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, Bryan Russell, Fabian Caba Heilbron
In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
no code implementations • CVPR 2024 • Reuben Tan, Ximeng Sun, Ping Hu, Jui-Hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.
no code implementations • 7 Dec 2023 • Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell
To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos.
1 code implementation • 15 Nov 2023 • Martin Cífka, Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Vladimir Petrik, Josef Sivic
We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.
1 code implementation • CVPR 2023 • Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, Simon Jenni
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications.
no code implementations • CVPR 2023 • Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell
A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.
1 code implementation • CVPR 2023 • Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens
Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like".
no code implementations • CVPR 2023 • Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko
We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data.
1 code implementation • 24 Oct 2022 • Hang Gao, RuiLong Li, Shubham Tulsiani, Bryan Russell, Angjoo Kanazawa
We study the recent progress on dynamic view synthesis (DVS) from monocular video.
no code implementations • CVPR 2022 • Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon
In order to capture the high-level concepts that are required to solve the task, we propose modeling the long-term temporal context of both the video and the music signals, using Transformer networks for each modality.
no code implementations • CVPR 2022 • Zhongzheng Ren, Aseem Agarwala, Bryan Russell, Alexander G. Schwing, Oliver Wang
We introduce an approach for selecting objects in neural volumetric 3D representations, such as multi-plane images (MPI) and neural radiance fields (NeRF).
2 code implementations • CVPR 2022 • Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Josef Sivic
We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.
no code implementations • NeurIPS 2021 • Reuben Tan, Bryan Plummer, Kate Saenko, Hailin Jin, Bryan Russell
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
1 code implementation • 12 Nov 2021 • Alex Andonian, Taesung Park, Bryan Russell, Phillip Isola, Jun-Yan Zhu, Richard Zhang
Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result.
no code implementations • 20 Oct 2021 • Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin Jin, Bryan Russell
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
1 code implementation • ICCV 2021 • Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell
Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object.
1 code implementation • ICCV 2021 • Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, Bryan Russell
In this paper, we explore enabling user editing of a category-level NeRF - also known as a conditional radiance field - trained on a shape category.
Ranked #1 on
Novel View Synthesis
on PhotoShape
1 code implementation • ECCV 2020 • Davis Rempe, Leonidas J. Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, Jimei Yang
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles.
no code implementations • CVPR 2020 • Karren Yang, Bryan Russell, Justin Salamon
Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs.
no code implementations • ICCV 2019 • Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, Thomas Brox
We show that methods trained on our dataset consistently perform well when tested on other datasets.
Ranked #24 on
3D Hand Pose Estimation
on FreiHAND
(PA-F@5mm metric)
no code implementations • ICCV 2019 • Carlo Innamorati, Bryan Russell, Danny M. Kaufman, and Niloy J. Mitra
We introduce a method to generate videos of dynamic virtual objects plausibly interacting via collisions with a still image's environment.
2 code implementations • 30 Jul 2019 • Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell
We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.
no code implementations • ICLR 2019 • Senthil Purushwalkam, Abhinav Gupta, Danny M. Kaufman, Bryan Russell
To achieve our results, we introduce the Bounce Dataset comprising 5K RGB-D videos of bouncing trajectories of a foam ball to probe surfaces of varying shapes and materials in everyday scenes including homes and offices.
no code implementations • 28 Feb 2019 • Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, Gautham J. Mysore
In video production, inserting B-roll is a widely used technique to enrich the story and make a video more engaging.
1 code implementation • EMNLP 2018 • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset.
2 code implementations • ECCV 2018 • Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid
Human shape estimation is an important task for video editing, animation and fashion industry.
Ranked #3 on
3D Human Pose Estimation
on Surreal
(using extra training data)
1 code implementation • 8 Aug 2017 • Zoya Bylinskii, Nam Wook Kim, Peter O'Donovan, Sami Alsheikh, Spandan Madan, Hanspeter Pfister, Fredo Durand, Bryan Russell, Aaron Hertzmann
Our models are neural networks trained on human clicks and importance annotations on hundreds of designs.
2 code implementations • ICCV 2017 • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment.
no code implementations • CVPR 2017 • Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video.
Ranked #8 on
Long-video Activity Recognition
on Breakfast
1 code implementation • 21 Feb 2017 • Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, Deva Ramanan
We explore design principles for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation.
no code implementations • NeurIPS 2016 • Peng Wang, Xiaohui Shen, Bryan Russell, Scott Cohen, Brian Price, Alan L. Yuille
This paper introduces an approach to regularize 2. 5D surface normal and depth predictions at each pixel given a single input image.
no code implementations • 21 Sep 2016 • Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, Deva Ramanan
We explore architectures for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation.
no code implementations • CVPR 2016 • Aayush Bansal, Bryan Russell, Abhinav Gupta
We introduce an approach that leverages surface normal predictions, along with appearance cues, to retrieve 3D models for objects depicted in 2D still images from a large CAD object library.
no code implementations • CVPR 2016 • Francisco Massa, Bryan Russell, Mathieu Aubry
This paper presents an end-to-end convolutional neural network (CNN) for 2D-3D exemplar detection.
no code implementations • ICCV 2015 • Mathieu Aubry, Bryan Russell
The rendered images are presented to a trained CNN and responses for different layers are studied with respect to the input scene factors.
no code implementations • NeurIPS 2012 • Jianxiong Xiao, Bryan Russell, Antonio Torralba
In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes.
no code implementations • NeurIPS 2009 • Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman
In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes.