Search Results for author: Mark Hasegawa-Johnson

Found 44 papers, 18 papers with code

Improving Self-Supervised Speech Representations by Disentangling Speakers

no code implementations20 Apr 2022 Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks.

Disentanglement Self-Supervised Learning

Equivariance Discovery by Learned Parameter-Sharing

1 code implementation7 Apr 2022 Raymond A. Yeh, Yuan-Ting Hu, Mark Hasegawa-Johnson, Alexander G. Schwing

Designing equivariance as an inductive bias into deep-nets has been a prominent approach to build effective models, e. g., a convolutional neural network incorporates translation equivariance.

Translation

Visualizations of Complex Sequences of Family-Infant Vocalizations Using Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features

no code implementations29 Mar 2022 Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

We demonstrate that our high-quality visualizations capture major types of family vocalization interactions, in categories indicative of mental, behavioral, and developmental health, for both labeled and unlabeled LB audio.

Speaker Diarization

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

no code implementations29 Mar 2022 Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech.

Automatic Speech Recognition Speech Synthesis +1

SpeechSplit 2.0: Unsupervised speech disentanglement for voice conversion Without tuning autoencoder Bottlenecks

1 code implementation26 Mar 2022 Chak Ho Chan, Kaizhi Qian, Yang Zhang, Mark Hasegawa-Johnson

SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders in an unsupervised manner.

Disentanglement Voice Conversion

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

1 code implementation26 Jan 2022 Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak

In this paper, we 1) investigate the influence of different factors (i. e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; 2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way.

Automatic Speech Recognition Transfer Learning +1

Fast and Efficient MMD-based Fair PCA via Optimization over Stiefel Manifold

1 code implementation23 Sep 2021 Junghyun Lee, Gwangsu Kim, Matt Olfat, Mark Hasegawa-Johnson, Chang D. Yoo

This paper defines fair principal component analysis (PCA) as minimizing the maximum mean discrepancy (MMD) between dimensionality-reduced conditional distributions of different protected classes.

Fairness

Global Rhythm Style Transfer Without Text Transcriptions

no code implementations16 Jun 2021 Kaizhi Qian, Yang Zhang, Shiyu Chang, JinJun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson

In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.

Representation Learning Style Transfer

Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering

no code implementations NAACL 2021 Kiran Ramnath, Leda Sari, Mark Hasegawa-Johnson, Chang Yoo

Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded.

Knowledge Graphs Question Answering +1

Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings

no code implementations31 Dec 2020 Kiran Ramnath, Mark Hasegawa-Johnson

Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature.

Common Sense Reasoning Knowledge Graph Embeddings +4

Multi-Decoder DPRNN: High Accuracy Source Counting and Separation

1 code implementation24 Nov 2020 Junzhe Zhu, Raymond Yeh, Mark Hasegawa-Johnson

Beyond the model, we also propose a metric on how to evaluate source separation with variable number of speakers.

Speech Separation

Show and Speak: Directly Synthesize Spoken Description of Images

1 code implementation23 Oct 2020 Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes.

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

1 code implementation22 Oct 2020 Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.

Automatic Speech Recognition

Deep F-measure Maximization for End-to-End Speech Understanding

no code implementations8 Aug 2020 Leda Sari, Mark Hasegawa-Johnson

We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.

Fairness Intent Detection +1

Evaluating Automatically Generated Phoneme Captions for Images

no code implementations31 Jul 2020 Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg

For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences.

Image Captioning

Identify Speakers in Cocktail Parties with End-to-End Attention

1 code implementation22 May 2020 Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari

In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately.

Speaker Identification Speech Separation

That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

no code implementations16 May 2020 Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies.

Automatic Speech Recognition

Automatic Estimation of Intelligibility Measure for Consonants in Speech

no code implementations12 May 2020 Ali Abavisani, Mark Hasegawa-Johnson

In this article, we provide a model to estimate a real-valued measure of the intelligibility of individual speech segments.

Automatic Speech Recognition

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

1 code implementation15 Apr 2020 Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, Gautham J. Mysore

Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.

Style Transfer Voice Conversion

Continuous Convolutional Neural Network forNonuniform Time Series

no code implementations25 Sep 2019 Hui Shi, Yang Zhang, Hao Wu, Shiyu Chang, Kaizhi Qian, Mark Hasegawa-Johnson, Jishen Zhao

Convolutional neural network (CNN) for time series data implicitly assumes that the data are uniformly sampled, whereas many event-based and multi-modal data are nonuniform or have heterogeneous sampling rates.

Time Series

Fast transcription of speech in low-resource languages

1 code implementation16 Sep 2019 Mark Hasegawa-Johnson, Camille Goudeseune, Gina-Anne Levow

We present software that, in only a few hours, transcribes forty hours of recorded speech in a surprise language, using only a few tens of megabytes of noisy text in that language, and a zero-resource grapheme to phoneme (G2P) table.

Language Modelling

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

10 code implementations14 May 2019 Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson

On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.

Style Transfer Voice Conversion

When CTC Training Meets Acoustic Landmarks

no code implementations5 Nov 2018 Di He, Xuesong Yang, Boon Pang Lim, Yi Liang, Mark Hasegawa-Johnson, Deming Chen

In this paper, the convergence properties of CTC are improved by incorporating acoustic landmarks.

Automatic Speech Recognition

Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks

no code implementations15 May 2018 Di He, Boon Pang Lim, Xuesong Yang, Mark Hasegawa-Johnson, Deming Chen

Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental.

Automatic Speech Recognition Multi-Task Learning

Deep Learning Based Speech Beamforming

no code implementations15 Feb 2018 Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florencio, Mark Hasegawa-Johnson

On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform efficient inference, but they are unable to deal with variable number of input channels.

Speech Enhancement

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

no code implementations7 Feb 2018 Xuesong Yang, Kartik Audhkhasi, Andrew Rosenberg, Samuel Thomas, Bhuvana Ramabhadran, Mark Hasegawa-Johnson

The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios.

Automatic Speech Recognition

Dilated Recurrent Neural Networks

2 code implementations NeurIPS 2017 Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark Hasegawa-Johnson, Thomas S. Huang

To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure, the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures.

Sequential Image Classification

Performance Improvements of Probabilistic Transcript-adapted ASR with Recurrent Neural Network and Language-specific Constraints

no code implementations13 Dec 2016 Xiang Kong, Preethi Jyothi, Mark Hasegawa-Johnson

Mismatched transcriptions have been proposed as a mean to acquire probabilistic transcriptions from non-native speakers of a language. Prior work has demonstrated the value of these transcriptions by successfully adapting cross-lingual ASR systems for different tar-get languages.

Cross-Lingual ASR

Clustering-based Phonetic Projection in Mismatched Crowdsourcing Channels for Low-resourced ASR

no code implementations WS 2016 Wenda Chen, Mark Hasegawa-Johnson, Nancy Chen, Preethi Jyothi, Lav Varshney

We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers.

Landmark-based consonant voicing detection on multilingual corpora

no code implementations10 Nov 2016 Xiang Kong, Xuesong Yang, Mark Hasegawa-Johnson, Jeung-Yoon Choi, Stefanie Shattuck-Hufnagel

Three consonant voicing classifiers were developed: (1) manually selected acoustic features anchored at a phonetic landmark, (2) MFCCs (either averaged across the segment or anchored at the landmark), and(3) acoustic features computed using a convolutional neural network (CNN).

Semantic Image Inpainting with Deep Generative Models

6 code implementations CVPR 2017 Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G. Schwing, Mark Hasegawa-Johnson, Minh N. Do

In this paper, we propose a novel method for semantic image inpainting, which generates the missing content by conditioning on the available data.

Image Inpainting

Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation

2 code implementations13 Feb 2015 Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, Paris Smaragdis

In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising.

Denoising Speech Denoising +1

Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech

no code implementations LREC 2014 Mohamed Elmahdy, Mark Hasegawa-Johnson, Eiman Mustafawi

In a second pass, a more restricted LM is generated for each audio segment, and unsupervised acoustic model adaptation is applied.

Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.