no code implementations • 8 Apr 2025 • Mingfei Chen, Israel D. Gebru, Ishwarya Ananthabhotla, Christian Richardt, Dejan Markovic, Jake Sandakly, Steven Krenn, Todd Keebler, Eli Shlizerman, Alexander Richard
Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint.
no code implementations • 7 Apr 2025 • Jihyun Lee, Weipeng Xu, Alexander Richard, Shih-En Wei, Shunsuke Saito, Shaojie Bai, Te-Li Wang, Minhyuk Sung, Tae-Kyun Kim, Jason Saragih
To enable real-time inference, we introduce (1) cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions in a fast, feed-forward manner, and (2) diffusion distillation, which enables high-quality motion estimation with a single denoising step.
1 code implementation • 3 Mar 2025 • Simon Welker, Matthew Le, Ricky T. Q. Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, Yi-Chiao Wu
We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method.
no code implementations • 18 Feb 2025 • Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollhoefer, Dimitris Samaras, Alexander Richard
Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars.
no code implementations • 4 Feb 2025 • Yi-Chiao Wu, Dejan Marković, Steven Krenn, Israel D. Gebru, Alexander Richard
Neural audio codecs have been widely adopted in audio-generative tasks because their compact and discrete representations are suitable for both large-language-model-style and regression-based generative models.
no code implementations • 18 Jul 2024 • Chao Huang, Dejan Markovic, Chenliang Xu, Alexander Richard
While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far.
2 code implementations • 10 Jun 2024 • Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
no code implementations • CVPR 2024 • Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard
The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms.
no code implementations • 22 Jan 2024 • Yi-Chiao Wu, Dejan Marković, Steven Krenn, Israel D. Gebru, Alexander Richard
Although recent mainstream waveform-domain end-to-end (E2E) neural audio codecs achieve impressive coded audio quality with a very low bitrate, the quality gap between the coded and natural audio is still significant.
1 code implementation • CVPR 2024 • Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard
We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction.
1 code implementation • NeurIPS 2023 • Xudong Xu, Dejan Markovic, Jacob Sandakly, Todd Keebler, Steven Krenn, Alexander Richard
While 3D human body modeling has received much attention in computer vision, modeling the acoustic equivalent, i. e. modeling 3D spatial audio produced by body motion and speech, has fallen short in the community.
2 code implementations • 26 May 2023 • Yi-Chiao Wu, Israel D. Gebru, Dejan Marković, Alexander Richard
A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i. e.\ the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i. e.\ encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal.
no code implementations • CVPR 2023 • Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi
We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
1 code implementation • 22 Jul 2022 • Cheng-hsin Wuu, Ningyuan Zheng, Scott Ardisson, Rohan Bali, Danielle Belko, Eric Brockmeyer, Lucas Evans, Timothy Godisart, Hyowon Ha, Xuhua Huang, Alexander Hypes, Taylor Koska, Steven Krenn, Stephen Lombardi, Xiaomin Luo, Kevyn McPhail, Laura Millerschoen, Michal Perdoch, Mark Pitts, Alexander Richard, Jason Saragih, Junko Saragih, Takaaki Shiratori, Tomas Simon, Matt Stewart, Autumn Trimble, Xinshuo Weng, David Whitewolf, Chenglei Wu, Shoou-I Yu, Yaser Sheikh
Along with the release of the dataset, we conduct ablation studies on the influence of different model architectures toward the model's interpolation capacity of novel viewpoint and expressions.
no code implementations • 8 Jul 2022 • Wen Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, Anjali Menon
In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb.
no code implementations • 30 Jun 2022 • Dejan Markovic, Alexandre Defossez, Alexander Richard
We present a single-stage casual waveform-to-waveform multichannel model that can separate moving sound sources based on their broad spatial locations in a dynamic acoustic scene.
1 code implementation • CVPR 2022 • Karren Yang, Dejan Markovic, Steven Krenn, Vasu Agrawal, Alexander Richard
Since facial actions such as lip movements contain significant information about speech content, it is not surprising that audio-visual speech enhancement methods are more accurate than their audio-only counterparts.
no code implementations • 15 Mar 2022 • Emre Aksan, Shugao Ma, Akin Caliskan, Stanislav Pidhorskyi, Alexander Richard, Shih-En Wei, Jason Saragih, Otmar Hilliges
To mitigate this asymmetry, we introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space.
2 code implementations • 10 Feb 2022 • Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, Yu Tsao
Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs.
Ranked #6 on
Speech Enhancement
on EARS-WHAM
no code implementations • 7 Feb 2022 • Alexander Richard, Peter Dodds, Vamsi Krishna Ithapu
Impulse response estimation in high noise and in-the-wild settings, with minimal control of the underlying data distributions, is a challenging problem.
2 code implementations • ICCV 2021 • Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando de la Torre, Yaser Sheikh
To improve upon existing models, we propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Ranked #2 on
3D Face Animation
on VOCASET
no code implementations • ICLR 2021 • Alexander Richard, Dejan Markovic, Israel D. Gebru, Steven Krenn, Gladstone Alexander Butler, Fernando Torre, Yaser Sheikh
We present a neural rendering approach for binaural sound synthesis that can produce realistic and spatially accurate binaural sound in realtime.
no code implementations • 11 Aug 2020 • Alexander Richard, Colin Lea, Shugao Ma, Juergen Gall, Fernando de la Torre, Yaser Sheikh
Codec Avatars are a recent class of learned, photorealistic face models that accurately represent the geometry and texture of a person in 3D (i. e., for virtual reality), and are almost indistinguishable from video.
no code implementations • 19 May 2020 • Yaser Souri, Alexander Richard, Luca Minciullo, Juergen Gall
Action segmentation is the task of temporally segmenting every frame of an untrimmed video.
1 code implementation • 3 Jun 2019 • Hilde Kuehne, Ahsan Iqbal, Alexander Richard, Juergen Gall
Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field.
no code implementations • 3 Jun 2019 • Hilde Kuehne, Alexander Richard, Juergen Gall
Action recognition has become a rapidly developing research field within the last decade.
no code implementations • CVPR 2018 • Alexander Richard, Hilde Kuehne, Ahsan Iqbal, Juergen Gall
Video learning is an important task in computer vision and has experienced increasing interest over the recent years.
1 code implementation • CVPR 2018 • Yazan Abu Farha, Alexander Richard, Juergen Gall
Analyzing human actions in videos has gained increased attention recently.
no code implementations • 27 Jun 2017 • Ahsan Iqbal, Alexander Richard, Hilde Kuehne, Juergen Gall
In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition.
1 code implementation • CVPR 2018 • Alexander Richard, Hilde Kuehne, Juergen Gall
Action detection and temporal segmentation of actions in videos are topics of increasing interest.
1 code implementation • 23 Mar 2017 • Alexander Richard, Juergen Gall
In this work, we propose a recurrent neural network that is equivalent to the traditional bag-of-words approach but enables for the application of discriminative training.
1 code implementation • CVPR 2017 • Alexander Richard, Hilde Kuehne, Juergen Gall
We present an approach for weakly supervised learning of human actions.
no code implementations • 7 Oct 2016 • Hilde Kuehne, Alexander Richard, Juergen Gall
Our system is based on the idea that, given a sequence of input data and a transcript, i. e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream, and thus, learn the related action models without the need for any frame-based annotation.
1 code implementation • CVPR 2016 • Alexander Richard, Juergen Gall
While current approaches to action recognition on pre-segmented video clips already achieve high accuracies, temporal action detection is still far from comparably good results.