Search Results for author: Justin Salamon

Found 23 papers, 9 papers with code

Video-Guided Foley Sound Generation with Multimodal Controls

no code implementations26 Nov 2024 Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon

MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning.

Audio Generation

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

no code implementations17 Aug 2023 Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto

We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video.

Contrastive Learning Retrieval

Language-Guided Music Recommendation for Video via Prompt Analogies

no code implementations CVPR 2023 Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell

A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.

4k Language Modelling +2

Efficient Spoken Language Recognition via Multilabel Classification

no code implementations2 Jun 2023 Oriol Nieto, Zeyu Jin, Franck Dernoncourt, Justin Salamon

Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal.

Classification

Conditional Generation of Audio from Video via Foley Analogies

1 code implementation CVPR 2023 Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like".

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

no code implementations CVPR 2023 Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data.

Audio Source Separation Natural Language Queries

It's Time for Artistic Correspondence in Music and Video

no code implementations CVPR 2022 Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon

In order to capture the high-level concepts that are required to solve the task, we propose modeling the long-term temporal context of both the video and the music signals, using Transformer networks for each modality.

Retrieval

Filler Word Detection and Classification: A Dataset and Benchmark

1 code implementation28 Mar 2022 Ge Zhu, Juan-Pablo Caceres, Justin Salamon

In this work, we present a novel speech dataset, PodcastFillers, with 35K annotated filler words and 50K annotations of other sounds that commonly occur in podcasts such as breaths, laughter, and word repetitions.

Classification Keyword Spotting +1

Emotion Embedding Spaces for Matching Music to Stories

1 code implementation26 Nov 2021 Minz Won, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore, Xavier Serra

Content creators often use music to enhance their stories, as it can be a powerful tool to convey emotion.

Cross-Modal Retrieval Metric Learning +1

Soundata: A Python library for reproducible use of audio datasets

no code implementations26 Sep 2021 Magdalena Fuentes, Justin Salamon, Pablo Zinemanas, Martín Rocamora, Genís Paja, Irán R. Román, Marius Miron, Xavier Serra, Juan Pablo Bello

Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version.

What's All the FUSS About Free Universal Sound Separation Data?

no code implementations2 Nov 2020 Scott Wisdom, Hakan Erdogan, Daniel Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, John Hershey

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types.

Data Augmentation

SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context

no code implementations11 Sep 2020 Mark Cartwright, Jason Cramer, Ana Elisa Mendez Mendez, Yu Wang, Ho-Hsiang Wu, Vincent Lostanlen, Magdalena Fuentes, Graham Dove, Charlie Mydlarz, Justin Salamon, Oded Nov, Juan Pablo Bello

In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags.

Disentangled Multidimensional Metric Learning for Music Similarity

no code implementations9 Aug 2020 Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, Juhan Nam

For this task, it is typically necessary to define a similarity metric to compare one recording to another.

Metric Learning Specificity +1

Metric Learning vs Classification for Disentangled Music Representation Learning

no code implementations9 Aug 2020 Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, Juhan Nam

For this, we (1) outline past work on the relationship between metric learning and classification, (2) extend this relationship to multi-label data by exploring three different learning approaches and their disentangled versions, and (3) evaluate all models on four tasks (training time, similarity retrieval, auto-tagging, and triplet prediction).

Classification Disentanglement +7

Controllable Neural Prosody Synthesis

no code implementations7 Aug 2020 Max Morrison, Zeyu Jin, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore

Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators.

Speech Synthesis

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

no code implementations CVPR 2020 Karren Yang, Bryan Russell, Justin Salamon

Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs.

audio-visual learning

Sound event detection in domestic environments withweakly labeled data and soundscape synthesis

1 code implementation26 Oct 2019 Nicolas Turpault, Romain Serizel, Ankit Shah, Justin Salamon

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results.

Event Detection Sound Event Detection

Robust sound event detection in bioacoustic sensor networks

1 code implementation20 May 2019 Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, Steve Kelling, Juan Pablo Bello

As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise.

Data Augmentation Event Detection +1

Adaptive pooling operators for weakly labeled sound event detection

2 code implementations26 Apr 2018 Brian McFee, Justin Salamon, Juan Pablo Bello

In this work, we treat SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality.

Event Detection Multiple Instance Learning +2

CREPE: A Convolutional Representation for Pitch Estimation

1 code implementation17 Feb 2018 Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello

To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics.

Information Retrieval Music Information Retrieval +1

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

5 code implementations IEEE Signal Processing Letters 2017 Justin Salamon, Juan Pablo Bello

We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation.

Data Augmentation Dictionary Learning +3

Cannot find the paper you are looking for? You can Submit a new open access paper.