no code implementations • 26 Nov 2024 • Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon
MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning.
no code implementations • 24 Oct 2024 • S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha
The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world.
no code implementations • 17 Sep 2024 • Ilaria Manco, Justin Salamon, Oriol Nieto
Audio-text contrastive models have become a powerful approach in music representation learning.
1 code implementation • 13 Sep 2024 • Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild.
1 code implementation • 17 Jun 2024 • Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio.
1 code implementation • 24 May 2024 • Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.
no code implementations • 12 Oct 2023 • Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs.
no code implementations • 17 Aug 2023 • Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto
We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video.
no code implementations • 2 Jun 2023 • Oriol Nieto, Zeyu Jin, Franck Dernoncourt, Justin Salamon
Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal.
no code implementations • CVPR 2023 • Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko
We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data.
no code implementations • 28 Apr 2022 • Nikhil Kandpal, Oriol Nieto, Zeyu Jin
Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ.
1 code implementation • 30 Oct 2020 • Minz Won, Sergio Oramas, Oriol Nieto, Fabien Gouyon, Xavier Serra
In this paper, we investigate three ideas to successfully introduce multimodal metric learning for tag-based music retrieval: elaborate triplet sampling, acoustic and cultural music information, and domain-specific word embeddings.
1 code implementation • 22 Oct 2020 • Filip Korzeniowski, Oriol Nieto, Matthew McCallum, Minz Won, Sergio Oramas, Erik Schmidt
The mood of a song is a highly relevant feature for exploration and recommendation in large collections of music.
no code implementations • 9 Feb 2018 • Samaneh Ebrahimi, Hossein Vahabi, Matthew Prockup, Oriol Nieto
In these platforms, which tend to host tens of thousands of unique audio advertisements (ads), providing high quality ads ensures a better user experience and results in longer user engagement.
4 code implementations • 7 Nov 2017 • Jordi Pons, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas Ehmann, Xavier Serra
The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms.
Sound Audio and Speech Processing
1 code implementation • 29 Jun 2017 • Sergio Oramas, Oriol Nieto, Mohamed Sordo, Xavier Serra
Second, track embeddings are learned from the audio signal and available feedback data.