Search Results for author: Oriol Nieto

Found 16 papers, 7 papers with code

Video-Guided Foley Sound Generation with Multimodal Controls

no code implementations26 Nov 2024 Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon

MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning.

Audio Generation

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

no code implementations24 Oct 2024 S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world.

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

1 code implementation13 Sep 2024 Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild.

Audio Classification Descriptive +2

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

1 code implementation17 Jun 2024 Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio.

Audio Question Answering Instruction Following +3

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

1 code implementation24 May 2024 Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha

To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.

Hallucination Response Generation +1

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

no code implementations12 Oct 2023 Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs.

Attribute Audio Classification +1

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

no code implementations17 Aug 2023 Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto

We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video.

Contrastive Learning Retrieval

Efficient Spoken Language Recognition via Multilabel Classification

no code implementations2 Jun 2023 Oriol Nieto, Zeyu Jin, Franck Dernoncourt, Justin Salamon

Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal.

Classification

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

no code implementations CVPR 2023 Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data.

Audio Source Separation Natural Language Queries

Music Enhancement via Image Translation and Vocoding

no code implementations28 Apr 2022 Nikhil Kandpal, Oriol Nieto, Zeyu Jin

Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ.

Image-to-Image Translation Translation

Multimodal Metric Learning for Tag-based Music Retrieval

1 code implementation30 Oct 2020 Minz Won, Sergio Oramas, Oriol Nieto, Fabien Gouyon, Xavier Serra

In this paper, we investigate three ideas to successfully introduce multimodal metric learning for tag-based music retrieval: elaborate triplet sampling, acoustic and cultural music information, and domain-specific word embeddings.

Cross-Modal Retrieval Metric Learning +5

Mood Classification Using Listening Data

1 code implementation22 Oct 2020 Filip Korzeniowski, Oriol Nieto, Matthew McCallum, Minz Won, Sergio Oramas, Erik Schmidt

The mood of a song is a highly relevant feature for exploration and recommendation in large collections of music.

Classification General Classification

Predicting Audio Advertisement Quality

no code implementations9 Feb 2018 Samaneh Ebrahimi, Hossein Vahabi, Matthew Prockup, Oriol Nieto

In these platforms, which tend to host tens of thousands of unique audio advertisements (ads), providing high quality ads ensures a better user experience and results in longer user engagement.

Rhythm

End-to-end learning for music audio tagging at scale

4 code implementations7 Nov 2017 Jordi Pons, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas Ehmann, Xavier Serra

The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms.

Sound Audio and Speech Processing

A Deep Multimodal Approach for Cold-start Music Recommendation

1 code implementation29 Jun 2017 Sergio Oramas, Oriol Nieto, Mohamed Sordo, Xavier Serra

Second, track embeddings are learned from the audio signal and available feedback data.

Music Recommendation

Cannot find the paper you are looking for? You can Submit a new open access paper.