no code implementations • 26 May 2025 • Dongyu Luo, Kelin Yu, Amir-Hossein Shahidzadeh, Cornelia Fermüller, Yiannis Aloimonos, Ruohan Gao
Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation.
no code implementations • CVPR 2025 • Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision.
no code implementations • 30 Apr 2025 • Derong Jin, Ruohan Gao
An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments.
no code implementations • CVPR 2025 • Xiulong Liu, Anurag Kumar, Paul Calamia, Sebastia V. Amengual, Calvin Murdock, Ishwarya Ananthabhotla, Philip Robinson, Eli Shlizerman, Vamsi Krishna Ithapu, Ruohan Gao
In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion.
1 code implementation • 29 Mar 2025 • Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha
This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications.
no code implementations • 3 Jan 2025 • Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency.
no code implementations • 31 Oct 2024 • Ricardo Falcon-Perez, Ruohan Gao, Gregor Mueckl, Sebastia V. Amengual Gari, Ishwarya Ananthabhotla
The task of Novel View Acoustic Synthesis (NVAS) - generating Room Impulse Responses (RIRs) for unseen source and receiver positions in a scene - has recently gained traction, especially given its relevance to Augmented Reality (AR) and Virtual Reality (VR) development.
no code implementations • 9 Aug 2024 • Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock
Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction.
1 code implementation • 1 Jul 2024 • Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio.
1 code implementation • CVPR 2024 • Mason Wang, Ryosuke Sawata, Samuel Clarke, Ruohan Gao, Shangzhe Wu, Jiajun Wu
Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications.
1 code implementation • 29 Feb 2024 • Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Mengdi Wu, Colin Unger, Zhihao Jia
Finetuning large language models (LLMs) is essential for task adaptation, yet serving stacks today isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware.
no code implementations • CVPR 2024 • Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao
We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video.
no code implementations • NeurIPS 2023 • Mason Wang, Samuel Clarke, Jui-Hsien Wang, Ruohan Gao, Jiajun Wu
A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions.
no code implementations • 2 Nov 2023 • Ruohan Zhang, Sharon Lee, Minjune Hwang, Ayano Hiranaka, Chen Wang, Wensi Ai, Jin Jie Ryan Tan, Shreya Gupta, Yilun Hao, Gabrael Levine, Ruohan Gao, Anthony Norcia, Li Fei-Fei, Jiajun Wu
We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals.
no code implementations • CVPR 2023 • Samuel Clarke, Ruohan Gao, Mason Wang, Mark Rau, Julia Xu, Jui-Hsien Wang, Doug L. James, Jiajun Wu
Objects make unique sounds under different perturbations, environment conditions, and poses relative to the listener.
1 code implementation • 1 Jun 2023 • Ruohan Gao, Hao Li, Gokul Dharan, Zhuzhu Wang, Chengshu Li, Fei Xia, Silvio Savarese, Li Fei-Fei, Jiajun Wu
We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear.
no code implementations • CVPR 2023 • Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu
We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch.
no code implementations • 29 Apr 2023 • Trevor Standley, Ruohan Gao, Dawn Chen, Jiajun Wu, Silvio Savarese
For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image.
no code implementations • 10 Mar 2023 • Hong-Xing Yu, Michelle Guo, Alireza Fathi, Yen-Yu Chang, Eric Ryan Chan, Ruohan Gao, Thomas Funkhouser, Jiajun Wu
We propose Object-Centric Neural Scattering Functions (OSFs) for learning to reconstruct object appearance from only images.
no code implementations • 7 Dec 2022 • Hao Li, Yizhi Zhang, Junzhe Zhu, Shaoxiong Wang, Michelle A Lee, Huazhe Xu, Edward Adelson, Li Fei-Fei, Ruohan Gao, Jiajun Wu
Humans use all of their senses to accomplish different tasks in everyday activities.
1 code implementation • 17 Oct 2022 • Simon Le Cleac'h, Hong-Xing Yu, Michelle Guo, Taylor A. Howell, Ruohan Gao, Jiajun Wu, Zachary Manchester, Mac Schwager
A robot can use this simulation to optimize grasps and manipulation trajectories of neural objects, or to improve the neural object models through gradient-based real-to-simulation transfer.
1 code implementation • CVPR 2022 • Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu
We present ObjectFolder 2. 0, a large-scale, multisensory dataset of common household objects in the form of implicit neural representations that significantly enhances ObjectFolder 1. 0 in three aspects.
1 code implementation • CVPR 2022 • Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
no code implementations • 21 Nov 2021 • Rishabh Garg, Ruohan Gao, Kristen Grauman
Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings.
no code implementations • 16 Sep 2021 • Ruohan Gao, Yen-Yu Chang, Shivani Mall, Li Fei-Fei, Jiajun Wu
Multisensory object-centric perception, reasoning, and interaction have been a key research topic in recent years.
1 code implementation • CVPR 2021 • Ruohan Gao, Kristen Grauman
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
1 code implementation • ICLR 2021 • Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e. g., a phone ringing in another room).
no code implementations • ECCV 2020 • Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Several animal species (e. g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world.
1 code implementation • CVPR 2020 • Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical.
Ranked #8 on
Action Recognition
on ActivityNet
3 code implementations • ICCV 2019 • Ruohan Gao, Kristen Grauman
Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel.
Ranked #1 on
Audio Denoising
on AV-Bench - Wooden Horse
2 code implementations • CVPR 2019 • Ruohan Gao, Kristen Grauman
We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations.
2 code implementations • ECCV 2018 • Ruohan Gao, Rogerio Feris, Kristen Grauman
Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video.
4 code implementations • CVPR 2018 • Ruohan Gao, Bo Xiong, Kristen Grauman
Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition.
no code implementations • ECCV 2018 • Dinesh Jayaraman, Ruohan Gao, Kristen Grauman
We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation.
1 code implementation • ICCV 2017 • Ruohan Gao, Kristen Grauman
While machine learning approaches to image restoration offer great promise, current methods risk training models fixated on performing well only for image corruption of a particular level of difficulty---such as a certain level of noise or blur.
no code implementations • 1 Dec 2016 • Ruohan Gao, Dinesh Jayaraman, Kristen Grauman
Compared to existing temporal coherence methods, our idea has the advantage of lightweight preprocessing of the unlabeled video (no tracking required) while still being able to extract object-level regions from which to learn invariances.