Search Results for author: Mark Hasegawa-Johnson

Found 59 papers, 26 papers with code

Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation

2 code implementations13 Feb 2015 Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, Paris Smaragdis

In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising.

Denoising Speech Denoising +1

Semantic Image Inpainting with Deep Generative Models

7 code implementations CVPR 2017 Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G. Schwing, Mark Hasegawa-Johnson, Minh N. Do

In this paper, we propose a novel method for semantic image inpainting, which generates the missing content by conditioning on the available data.

Image Inpainting

Landmark-based consonant voicing detection on multilingual corpora

no code implementations10 Nov 2016 Xiang Kong, Xuesong Yang, Mark Hasegawa-Johnson, Jeung-Yoon Choi, Stefanie Shattuck-Hufnagel

Three consonant voicing classifiers were developed: (1) manually selected acoustic features anchored at a phonetic landmark, (2) MFCCs (either averaged across the segment or anchored at the landmark), and(3) acoustic features computed using a convolutional neural network (CNN).

Clustering-based Phonetic Projection in Mismatched Crowdsourcing Channels for Low-resourced ASR

no code implementations WS 2016 Wenda Chen, Mark Hasegawa-Johnson, Nancy Chen, Preethi Jyothi, Lav Varshney

We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers.

Clustering

Performance Improvements of Probabilistic Transcript-adapted ASR with Recurrent Neural Network and Language-specific Constraints

no code implementations13 Dec 2016 Xiang Kong, Preethi Jyothi, Mark Hasegawa-Johnson

Mismatched transcriptions have been proposed as a mean to acquire probabilistic transcriptions from non-native speakers of a language. Prior work has demonstrated the value of these transcriptions by successfully adapting cross-lingual ASR systems for different tar-get languages.

Cross-Lingual ASR TAR

Dilated Recurrent Neural Networks

2 code implementations NeurIPS 2017 Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark Hasegawa-Johnson, Thomas S. Huang

To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure, the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures.

Sequential Image Classification

Deep Learning Based Speech Beamforming

no code implementations15 Feb 2018 Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florencio, Mark Hasegawa-Johnson

On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform efficient inference, but they are unable to deal with variable number of input channels.

Speech Enhancement

Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks

no code implementations15 May 2018 Di He, Boon Pang Lim, Xuesong Yang, Mark Hasegawa-Johnson, Deming Chen

Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

When CTC Training Meets Acoustic Landmarks

no code implementations5 Nov 2018 Di He, Xuesong Yang, Boon Pang Lim, Yi Liang, Mark Hasegawa-Johnson, Deming Chen

In this paper, the convergence properties of CTC are improved by incorporating acoustic landmarks.

Automatic Speech Recognition (ASR)

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

11 code implementations14 May 2019 Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson

On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.

Style Transfer Voice Conversion

Fast transcription of speech in low-resource languages

1 code implementation16 Sep 2019 Mark Hasegawa-Johnson, Camille Goudeseune, Gina-Anne Levow

We present software that, in only a few hours, transcribes forty hours of recorded speech in a surprise language, using only a few tens of megabytes of noisy text in that language, and a zero-resource grapheme to phoneme (G2P) table.

Language Modelling

Continuous Convolutional Neural Network forNonuniform Time Series

no code implementations25 Sep 2019 Hui Shi, Yang Zhang, Hao Wu, Shiyu Chang, Kaizhi Qian, Mark Hasegawa-Johnson, Jishen Zhao

Convolutional neural network (CNN) for time series data implicitly assumes that the data are uniformly sampled, whereas many event-based and multi-modal data are nonuniform or have heterogeneous sampling rates.

Time Series Time Series Analysis

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

1 code implementation15 Apr 2020 Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, Gautham J. Mysore

Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.

Style Transfer Voice Conversion

Automatic Estimation of Intelligibility Measure for Consonants in Speech

no code implementations12 May 2020 Ali Abavisani, Mark Hasegawa-Johnson

In this article, we provide a model to estimate a real-valued measure of the intelligibility of individual speech segments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Identify Speakers in Cocktail Parties with End-to-End Attention

1 code implementation22 May 2020 Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari

In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately.

Speaker Identification Speech Separation

Evaluating Automatically Generated Phoneme Captions for Images

no code implementations31 Jul 2020 Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg

For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences.

Image Captioning

Deep F-measure Maximization for End-to-End Speech Understanding

no code implementations8 Aug 2020 Leda Sari, Mark Hasegawa-Johnson

We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.

Fairness Intent Detection +1

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

1 code implementation22 Oct 2020 Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Show and Speak: Directly Synthesize Spoken Description of Images

1 code implementation23 Oct 2020 Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes.

Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings

no code implementations31 Dec 2020 Kiran Ramnath, Mark Hasegawa-Johnson

Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature.

Common Sense Reasoning Knowledge Graph Embeddings +4

Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering

no code implementations NAACL 2021 Kiran Ramnath, Leda Sari, Mark Hasegawa-Johnson, Chang Yoo

Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded.

Knowledge Graphs Question Answering +2

Global Rhythm Style Transfer Without Text Transcriptions

1 code implementation16 Jun 2021 Kaizhi Qian, Yang Zhang, Shiyu Chang, JinJun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson

In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.

Representation Learning Style Transfer

Fast and Efficient MMD-based Fair PCA via Optimization over Stiefel Manifold

2 code implementations23 Sep 2021 Junghyun Lee, Gwangsu Kim, Matt Olfat, Mark Hasegawa-Johnson, Chang D. Yoo

This paper defines fair principal component analysis (PCA) as minimizing the maximum mean discrepancy (MMD) between dimensionality-reduced conditional distributions of different protected classes.

Fairness

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

1 code implementation26 Jan 2022 Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak

In this paper, we 1) investigate the influence of different factors (i. e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; 2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

SpeechSplit 2.0: Unsupervised speech disentanglement for voice conversion Without tuning autoencoder Bottlenecks

1 code implementation26 Mar 2022 Chak Ho Chan, Kaizhi Qian, Yang Zhang, Mark Hasegawa-Johnson

SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders in an unsupervised manner.

Disentanglement Voice Conversion

Visualizations of Complex Sequences of Family-Infant Vocalizations Using Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features

1 code implementation29 Mar 2022 Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

We demonstrate that our high-quality visualizations capture major types of family vocalization interactions, in categories indicative of mental, behavioral, and developmental health, for both labeled and unlabeled LB audio.

speaker-diarization Speaker Diarization

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

1 code implementation29 Mar 2022 Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Equivariance Discovery by Learned Parameter-Sharing

1 code implementation7 Apr 2022 Raymond A. Yeh, Yuan-Ting Hu, Mark Hasegawa-Johnson, Alexander G. Schwing

Designing equivariance as an inductive bias into deep-nets has been a prominent approach to build effective models, e. g., a convolutional neural network incorporates translation equivariance.

Inductive Bias Translation

ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

1 code implementation20 Apr 2022 Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks.

Disentanglement Self-Supervised Learning

End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions

1 code implementation19 May 2022 Wonjune Kang, Mark Hasegawa-Johnson, Deb Roy

Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker.

Speech Synthesis Style Transfer +1

Forget-free Continual Learning with Winning Subnetworks

1 code implementation International Conference on Machine Learning 2022 Haeyong Kang, Rusty John Lloyd Mina, Sultan Rizky Hikmawan Madjid, Jaehong Yoon, Mark Hasegawa-Johnson, Sung Ju Hwang, Chang D. Yoo

Inspired by Lottery Ticket Hypothesis that competitive subnetworks exist within a dense network, we propose a continual learning method referred to as Winning SubNetworks (WSN), which sequentially learns and selects an optimal subnetwork for each task.

Continual Learning

Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction

no code implementations9 Jul 2022 Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson

Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time.

Speech Extraction

SMSMix: Sense-Maintained Sentence Mixup for Word Sense Disambiguation

no code implementations14 Dec 2022 Hee Suk Yoon, Eunseop Yoon, John Harvill, Sunjae Yoon, Mark Hasegawa-Johnson, Chang D. Yoo

To the best of our knowledge, this is the first attempt to apply mixup in NLP while preserving the meaning of a specific word.

Data Augmentation Sentence +1

Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio

no code implementations21 May 2023 Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

To perform automatic family audio analysis, past studies have collected recordings using phone, video, or audio-only recording devices like LENA, investigated supervised learning methods, and used or fine-tuned general-purpose embeddings learned from large pretrained models.

speaker-diarization Speaker Diarization

INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced Non-Native Speech Recognition

no code implementations25 May 2023 Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo

INTapt is trained simultaneously in the following two manners: (1) adversarial training to reduce accent feature dependence between the original input and the prompt-concatenated input and (2) training to minimize CTC loss for improving ASR performance to a prompt-concatenated input.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

A Theory of Unsupervised Speech Recognition

1 code implementation9 Jun 2023 Liming Wang, Mark Hasegawa-Johnson, Chang D. Yoo

Unsupervised speech recognition (ASR-U) is the problem of learning automatic speech recognition (ASR) systems from unpaired speech-only and text-only corpora.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Enhancing Child Vocalization Classification in Multi-Channel Child-Adult Conversations Through Wav2vec2 Children ASR Features

no code implementations13 Sep 2023 Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

In this study, we leverage the self-supervised learning model, Wav2Vec 2. 0 (W2V2), pretrained on 4300h of home recordings of children under 5 years old, to build a unified system that performs both speaker diarization (SD) and vocalization classification (VC) tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching

1 code implementation3 Oct 2023 Liming Wang, Mark Hasegawa-Johnson, Chang D. Yoo

Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands.

speech-recognition Speech Recognition +1

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

no code implementations30 Nov 2023 Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou

We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations.

Denoising Image Generation

Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations

no code implementations10 Feb 2024 Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer representations in adult speech.

Self-Supervised Learning

C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

no code implementations21 Mar 2024 Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, Chang D. Yoo

Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions.

Test-time Adaptation

Syn2Vec: Synset Colexification Graphs for Lexical Semantic Similarity

1 code implementation NAACL 2022 John Harvill, Roxana Girju, Mark Hasegawa-Johnson

In this paper we focus on patterns of colexification (co-expressions of form-meaning mapping in the lexicon) as an aspect of lexical-semantic organization, and use them to build large scale synset graphs across BabelNet’s typologically diverse set of 499 world languages.

Semantic Similarity Semantic Textual Similarity

Cannot find the paper you are looking for? You can Submit a new open access paper.