1 code implementation • 1 Feb 2024 • Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang
Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks.
no code implementations • 3 Oct 2023 • Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh
In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs.
1 code implementation • 14 Sep 2023 • Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang
During inference, the text encoder is replaced with the pretrained CLAP audio encoder.
1 code implementation • NeurIPS 2023 • Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang
We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks.
no code implementations • 20 Feb 2023 • Laurie M. Heller, Benjamin Elizalde, Bhiksha Raj, Soham Deshmukh
Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans.
no code implementations • 14 Nov 2022 • Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh
We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval.
1 code implementation • 28 Sep 2022 • Soham Deshmukh, Benjamin Elizalde, Huaming Wang
In this work, we propose a new collection of web audio-text pairs and a new framework for retrieval.
no code implementations • 20 Feb 2020 • Jianyu Fan, Eric Nichols, Daniel Tompkins, Ana Elisa Mendez Mendez, Benjamin Elizalde, Philippe Pasquier
State of the art sound event retrieval models have focused on single-label audio recordings, with only one sound event occurring, rather than on multi-label audio recordings (i. e., multiple sound events occur in one recording).
no code implementations • NIPS Workshop on Machine Learning for Audio 2018 • Benjamin Elizalde, Rohan Badlani, Ankit Shah, Anurag Kumar, and Bhiksha Raj.
Sounds are essential to how humans perceive and interact with the world.
no code implementations • 2 Nov 2017 • Rohan Badlani, Ankit Shah, Benjamin Elizalde, Anurag Kumar, Bhiksha Raj
The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets.
no code implementations • 20 Sep 2016 • Benjamin Elizalde, Ankit Shah, Siddharth Dalmia, Min Hun Lee, Rohan Badlani, Anurag Kumar, Bhiksha Raj, Ian Lane
The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube.
no code implementations • 13 Jul 2016 • Sebastian Sager, Benjamin Elizalde, Damian Borth, Christian Schulze, Bhiksha Raj, Ian Lane
One contribution is the previously unavailable documentation of the challenges and implications of collecting audio recordings with these type of labels.
no code implementations • 12 Jul 2016 • Benjamin Elizalde, Guan-Lin Chao, Ming Zeng, Ian Lane
In particular, we present a method to compute and use semantic acoustic features to perform city-identification and the features show semantic evidence of the identification.
no code implementations • 13 Mar 2015 • Julia Bernd, Damian Borth, Benjamin Elizalde, Gerald Friedland, Heather Gallagher, Luke Gottlieb, Adam Janin, Sara Karabashlieva, Jocelyn Takahashi, Jennifer Won
The YLI Multimedia Event Detection corpus is a public-domain index of videos with annotations and computed features, specialized for research in multimedia event detection (MED), i. e., automatically identifying what's happening in a video by analyzing the audio and visual content.
2 code implementations • 5 Mar 2015 • Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, Li-Jia Li
We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released.
Multimedia Computers and Society H.3.7