Search Results for author: Ruohan Gao

Found 36 papers, 16 papers with code

Learning to Highlight Audio by Watching Movies

no code implementations CVPR 2025 Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh

To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision.

Differentiable Room Acoustic Rendering with Multi-View Vision Priors

no code implementations30 Apr 2025 Derong Jin, Ruohan Gao

An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments.

Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

1 code implementation29 Mar 2025 Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha

This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications.

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

no code implementations3 Jan 2025 Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency.

Adversarial Attack Diagnostic

Novel View Acoustic Parameter Estimation

no code implementations31 Oct 2024 Ricardo Falcon-Perez, Ruohan Gao, Gregor Mueckl, Sebastia V. Amengual Gari, Ishwarya Ananthabhotla

The task of Novel View Acoustic Synthesis (NVAS) - generating Room Impulse Responses (RIRs) for unseen source and receiver positions in a scene - has recently gained traction, especially given its relevance to Augmented Reality (AR) and Virtual Reality (VR) development.

3D geometry Image-to-Image Translation +2

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

1 code implementation1 Jul 2024 Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio.

Fact Checking Language Modeling +3

Hearing Anything Anywhere

1 code implementation CVPR 2024 Mason Wang, Ryosuke Sawata, Samuel Clarke, Ruohan Gao, Shangzhe Wu, Jiajun Wu

Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications.

Mixed Reality Room Impulse Response (RIR)

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

1 code implementation29 Feb 2024 Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Mengdi Wu, Colin Unger, Zhihao Jia

Finetuning large language models (LLMs) is essential for task adaptation, yet serving stacks today isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware.

Language Modeling Language Modelling +1

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

no code implementations CVPR 2024 Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao

We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video.

SoundCam: A Dataset for Finding Humans Using Room Acoustics

no code implementations NeurIPS 2023 Mason Wang, Samuel Clarke, Jui-Hsien Wang, Ruohan Gao, Jiajun Wu

A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions.

NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities

no code implementations2 Nov 2023 Ruohan Zhang, Sharon Lee, Minjune Hwang, Ayano Hiranaka, Chen Wang, Wensi Ai, Jin Jie Ryan Tan, Shreya Gupta, Yilun Hao, Gabrael Levine, Ruohan Gao, Anthony Norcia, Li Fei-Fei, Jiajun Wu

We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals.

EEG

Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear

1 code implementation1 Jun 2023 Ruohan Gao, Hao Li, Gokul Dharan, Zhuzhu Wang, Chengshu Li, Fei Xia, Silvio Savarese, Li Fei-Fei, Jiajun Wu

We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear.

Multi-Task Learning Visual Navigation

The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects

no code implementations CVPR 2023 Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu

We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch.

Benchmarking Object +1

An Extensible Multimodal Multi-task Object Dataset with Materials

no code implementations29 Apr 2023 Trevor Standley, Ruohan Gao, Dawn Chen, Jiajun Wu, Silvio Savarese

For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image.

Attribute Multi-Task Learning +1

Differentiable Physics Simulation of Dynamics-Augmented Neural Objects

1 code implementation17 Oct 2022 Simon Le Cleac'h, Hong-Xing Yu, Michelle Guo, Taylor A. Howell, Ruohan Gao, Jiajun Wu, Zachary Manchester, Mac Schwager

A robot can use this simulation to optimize grasps and manipulation trajectories of neural objects, or to improve the neural object models through gradient-based real-to-simulation transfer.

Friction Object

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

1 code implementation CVPR 2022 Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu

We present ObjectFolder 2. 0, a large-scale, multisensory dataset of common household objects in the form of implicit neural representations that significantly enhances ObjectFolder 1. 0 in three aspects.

Object

Visual Acoustic Matching

1 code implementation CVPR 2022 Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.

Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

no code implementations21 Nov 2021 Rishabh Garg, Ruohan Gao, Kristen Grauman

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings.

Multi-Task Learning Room Impulse Response (RIR)

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

1 code implementation CVPR 2021 Ruohan Gao, Kristen Grauman

Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.

Speech Separation

Learning to Set Waypoints for Audio-Visual Navigation

1 code implementation ICLR 2021 Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e. g., a phone ringing in another room).

Visual Navigation

VisualEchoes: Spatial Image Representation Learning through Echolocation

no code implementations ECCV 2020 Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman

Several animal species (e. g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world.

Monocular Depth Estimation Representation Learning +3

Co-Separating Sounds of Visual Objects

3 code implementations ICCV 2019 Ruohan Gao, Kristen Grauman

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel.

Audio Denoising Audio Source Separation +1

2.5D Visual Sound

2 code implementations CVPR 2019 Ruohan Gao, Kristen Grauman

We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations.

Learning to Separate Object Sounds by Watching Unlabeled Video

2 code implementations ECCV 2018 Ruohan Gao, Rogerio Feris, Kristen Grauman

Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video.

Audio Denoising Audio Source Separation +2

Im2Flow: Motion Hallucination from Static Images for Action Recognition

4 code implementations CVPR 2018 Ruohan Gao, Bo Xiong, Kristen Grauman

Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition.

Action Recognition Decoder +3

ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids

no code implementations ECCV 2018 Dinesh Jayaraman, Ruohan Gao, Kristen Grauman

We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation.

Decoder Object +1

On-Demand Learning for Deep Image Restoration

1 code implementation ICCV 2017 Ruohan Gao, Kristen Grauman

While machine learning approaches to image restoration offer great promise, current methods risk training models fixated on performing well only for image corruption of a particular level of difficulty---such as a certain level of noise or blur.

Deblurring Image Deblurring +3

Object-Centric Representation Learning from Unlabeled Videos

no code implementations1 Dec 2016 Ruohan Gao, Dinesh Jayaraman, Kristen Grauman

Compared to existing temporal coherence methods, our idea has the advantage of lightweight preprocessing of the unlabeled video (no tracking required) while still being able to extract object-level regions from which to learn invariances.

image-classification Image Classification +4

Cannot find the paper you are looking for? You can Submit a new open access paper.