Search Results for author: Aren Jansen

Found 24 papers, 7 papers with code

Towards Learning a Universal Non-Semantic Representation of Speech

1 code implementation25 Feb 2020 Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv

The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks.

Transfer Learning

MuLan: A Joint Embedding of Music Audio and Natural Language

1 code implementation26 Aug 2022 Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, Daniel P. W. Ellis

Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries.

Cross-Modal Retrieval Music Tagging +2

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Action Classification Action Recognition +2

Self-Supervised Learning from Automatically Separated Sound Scenes

1 code implementation5 May 2021 Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings.

Contrastive Learning Self-Supervised Learning

A segmental framework for fully-unsupervised large-vocabulary speech recognition

5 code implementations22 Jun 2016 Herman Kamper, Aren Jansen, Sharon Goldwater

We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding).

Language Modelling Speech Recognition +1

Unsupervised Learning of Semantic Audio Representations

no code implementations6 Nov 2017 Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous

Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds.

Audio Classification General Classification +1

Scalable Out-of-Sample Extension of Graph Embeddings Using Deep Neural Networks

no code implementations18 Aug 2015 Aren Jansen, Gregory Sell, Vince Lyzinski

Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space.

Dimensionality Reduction Graph Embedding +2

Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

no code implementations9 Mar 2016 Herman Kamper, Aren Jansen, Sharon Goldwater

In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text.

Language Acquisition Language Modelling +1

Evaluating Low-Level Speech Features Against Human Perceptual Data

no code implementations TACL 2017 Caitlin Richter, Naomi H. Feldman, Harini Salgado, Aren Jansen

We introduce a method for measuring the correspondence between low-level speech features and human perception, using a cognitive model of speech perception implemented directly on speech recordings.

Automatic Speech Recognition (ASR) Representation Learning +1

Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems

no code implementations LREC 2014 Bogdan Ludusan, Maarten Versteegh, Aren Jansen, Guillaume Gravier, Xuan-Nga Cao, Mark Johnson, Emmanuel Dupoux

The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint.

Language Acquisition

Improving Universal Sound Separation Using Sound Classification

no code implementations18 Nov 2019 Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, Daniel P. W. Ellis

Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification.

Audio Source Separation Classification +2

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

no code implementations ICLR 2021 Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey

For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.

Scene Understanding

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

no code implementations1 Jun 2021 Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation.

Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

no code implementations9 Oct 2021 Joel Shor, Aren Jansen, Wei Han, Daniel Park, Yu Zhang

Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech.

Text-Driven Separation of Arbitrary Sounds

no code implementations12 Apr 2022 Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

The second model, SoundFilter, takes a mixed source audio clip as an input and separates it based on a conditioning vector from the shared text-audio representation defined by SoundWords, making the model agnostic to the conditioning modality.

MAQA: A Multimodal QA Benchmark for Negation

no code implementations9 Jan 2023 Judith Yue Li, Aren Jansen, Qingqing Huang, Joonseok Lee, Ravi Ganti, Dima Kuzmin

Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs).

Negation Question Answering

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

no code implementations11 May 2023 Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, Timo I. Denk

Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures.

Music Generation

Dataset balancing can hurt model performance

no code implementations30 Jun 2023 R. Channing Moore, Daniel P. W. Ellis, Eduardo Fonseca, Shawn Hershey, Aren Jansen, Manoj Plakal

We find, however, that while balancing improves performance on the public AudioSet evaluation data it simultaneously hurts performance on an unpublished evaluation set collected under the same conditions.

Cannot find the paper you are looking for? You can Submit a new open access paper.