Search Results for author: David Harwath

Found 44 papers, 23 papers with code

Deep Multimodal Semantic Embeddings for Speech and Images

no code implementations • 11 Nov 2015 • David Harwath, James Glass

In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities.

Image Retrieval

Paper
Add Code

Unsupervised Learning of Spoken Language with Visual Context

no code implementations • NeurIPS 2016 • David Harwath, Antonio Torralba, James Glass

Humans learn to speak before they can read or write, so why can’t computers do the same?

Image Retrieval Language Acquisition

Paper
Add Code

Learning Word-Like Units from Joint Audio-Visual Analysis

no code implementations • ACL 2017 • David Harwath, James R. Glass

Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Learning Modality-Invariant Representations for Speech and Images

no code implementations • 11 Dec 2017 • Kenneth Leidal, David Harwath, James Glass

In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs.

Information Retrieval Retrieval +3

Paper
Add Code

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

no code implementations • ECCV 2018 • David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to.

Retrieval

Paper
Add Code

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

no code implementations • 9 Apr 2018 • David Harwath, Galen Chuang, James Glass

In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images.

Retrieval speech-recognition +1

Paper
Add Code

Towards Visually Grounded Sub-Word Speech Unit Discovery

no code implementations • 21 Feb 2019 • David Harwath, James Glass

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes.

Paper
Add Code

Transfer Learning from Audio-Visual Grounding to Speech Recognition

no code implementations • 9 Jul 2019 • Wei-Ning Hsu, David Harwath, James Glass

Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks.

speech-recognition Speech Recognition +2

Paper
Add Code

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

1 code implementation • ICLR 2020 • David Harwath, Wei-Ning Hsu, James Glass

What differentiates this paper from prior work on speech unit learning is the choice of training objective.

Image Retrieval Quantization +1

Paper
Code

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

1 code implementation • 16 Jun 2020 • Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Code

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

no code implementations • ACL 2021 • Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision.

Image Captioning Speech Synthesis +1

Paper
Add Code

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

1 code implementation • ICCV 2021 • Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.

Ranked #4 on Long Video Retrieval (Background Removed) on YouCook2

Clustering Contrastive Learning +6

Paper
Code

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

no code implementations • CVPR 2021 • Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video.

Contrastive Learning Retrieval +1

Paper
Add Code

Learning Audio-Visual Dereverberation

1 code implementation • 14 Jun 2021 • Changan Chen, Wei Sun, David Harwath, Kristen Grauman

We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Code

Fast-Slow Transformer for Visually Grounding Speech

1 code implementation • 16 Sep 2021 • Puyuan Peng, David Harwath

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.

Image Retrieval Retrieval

Paper
Code

Cascaded Multilingual Audio-Visual Learning from Videos

1 code implementation • 8 Nov 2021 • Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

In this paper, we explore self-supervised audio-visual models that learn from instructional videos.

audio-visual learning Retrieval

Paper
Code

Routing with Self-Attention for Multimodal Capsule Networks

no code implementations • 1 Dec 2021 • Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.

Paper
Add Code

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

1 code implementation • 8 Dec 2021 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.

Action Localization Retrieval +2

Paper
Code

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

1 code implementation • CVPR 2022 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne

In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.

Action Localization Retrieval +2

Paper
Code

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

1 code implementation • 7 Feb 2022 • Puyuan Peng, David Harwath

In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark.

Language Modelling Masked Language Modeling +2

Paper
Code

Word Discovery in Visually Grounded, Self-Supervised Speech Models

3 code implementations • 28 Mar 2022 • Puyuan Peng, David Harwath

We present a method for visually-grounded spoken term discovery.

Clustering Segmentation +1

Paper
Code

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

2 code implementations • 30 Mar 2022 • Alan Baade, Puyuan Peng, David Harwath

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.

Audio Classification

Paper
Code

Contrastive Audio-Visual Masked Autoencoder

1 code implementation • 2 Oct 2022 • Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.

Ranked #1 on Audio Tagging on AudioSet (using extra training data)

Audio Classification Audio Tagging +4

199

Paper
Code

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

1 code implementation • 3 Oct 2022 • Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.

Language Modelling Retrieval +1

104

Paper
Code

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

1 code implementation • 7 Oct 2022 • Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.

Knowledge Distillation Retrieval +2

Paper
Code

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

1 code implementation • 1 Nov 2022 • Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning.

Data Augmentation Image Retrieval +2

Paper
Code

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

no code implementations • 2 Nov 2022 • Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.

Image Retrieval Retrieval +1

Paper
Add Code

Phoneme Segmentation Using Self-Supervised Speech Models

1 code implementation • 2 Nov 2022 • Luke Strgar, David Harwath

We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task.

Segmentation Transfer Learning

Paper
Code

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

no code implementations • 2 Dec 2022 • Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed

Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

no code implementations • 3 Dec 2022 • Reem Gody, David Harwath

Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

1 code implementation • 18 May 2023 • Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

Audio-Visual Speech Recognition Prompt Engineering +2

119

Paper
Code

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

2 code implementations • 19 May 2023 • Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.

Language Modelling Masked Language Modeling +3

Paper
Code

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

no code implementations • 21 May 2023 • Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.

Paper
Add Code

Textless Low-Resource Speech-to-Speech Translation With Unit Language Models

1 code implementation • 24 May 2023 • Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data.

Automatic Speech Recognition Denoising +6

Paper
Code

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

1 code implementation • 14 Jun 2023 • Anuj Diwan, Eunsol Choi, David Harwath

We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.

Paper
Code

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

no code implementations • 27 Jun 2023 • Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.

Multi-Task Learning Scene Understanding +3

Paper
Add Code

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

1 code implementation • 19 Sep 2023 • Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information.

audio-visual learning Representation Learning

Paper
Code

Audio-Visual Neural Syntax Acquisition

no code implementations • 11 Oct 2023 • Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

We study phrase structure induction from visually-grounded speech.

Language Acquisition

Paper
Add Code

BAT: Learning to Reason about Spatial Sounds with Large Language Models

no code implementations • 2 Feb 2024 • Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment.

Event Detection Language Modelling +5

Paper
Add Code

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

no code implementations • 8 Feb 2024 • Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-Yi Lee, David Harwath

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks.

Spoken Language Understanding

Paper
Add Code

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

1 code implementation • 10 Feb 2024 • Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-Yi Lee, Hsin-Min Wang, David Harwath

Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.

Keyword Extraction Multi-Task Learning +2

Paper
Code

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

2 code implementations • 25 Mar 2024 • Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.

Language Modelling

6,427

Paper
Code

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

no code implementations • 8 Apr 2024 • Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Paper
Add Code

Speak: A Toolkit Using Amazon Mechanical Turk to Collect and Validate Speech Audio Recordings

no code implementations • LREC 2022 • Christopher Song, David Harwath, Tuka Alhanai, James Glass

We present Speak, a toolkit that allows researchers to crowdsource speech audio recordings using Amazon Mechanical Turk (MTurk).

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.