Search Results for author: David Harwath

Found 44 papers, 23 papers with code

Deep Multimodal Semantic Embeddings for Speech and Images

no code implementations11 Nov 2015 David Harwath, James Glass

In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities.

Image Retrieval

Learning Word-Like Units from Joint Audio-Visual Analysis

no code implementations ACL 2017 David Harwath, James R. Glass

Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Learning Modality-Invariant Representations for Speech and Images

no code implementations11 Dec 2017 Kenneth Leidal, David Harwath, James Glass

In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs.

Information Retrieval Retrieval +3

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

no code implementations ECCV 2018 David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to.

Retrieval

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

no code implementations9 Apr 2018 David Harwath, Galen Chuang, James Glass

In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images.

Retrieval speech-recognition +1

Towards Visually Grounded Sub-Word Speech Unit Discovery

no code implementations21 Feb 2019 David Harwath, James Glass

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes.

Transfer Learning from Audio-Visual Grounding to Speech Recognition

no code implementations9 Jul 2019 Wei-Ning Hsu, David Harwath, James Glass

Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks.

speech-recognition Speech Recognition +2

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

no code implementations ACL 2021 Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision.

Image Captioning Speech Synthesis +1

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

no code implementations CVPR 2021 Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video.

Contrastive Learning Retrieval +1

Learning Audio-Visual Dereverberation

1 code implementation14 Jun 2021 Changan Chen, Wei Sun, David Harwath, Kristen Grauman

We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Fast-Slow Transformer for Visually Grounding Speech

1 code implementation16 Sep 2021 Puyuan Peng, David Harwath

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.

Image Retrieval Retrieval

Routing with Self-Attention for Multimodal Capsule Networks

no code implementations1 Dec 2021 Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

1 code implementation8 Dec 2021 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.

Action Localization Retrieval +2

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

1 code implementation CVPR 2022 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne

In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.

Action Localization Retrieval +2

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

2 code implementations30 Mar 2022 Alan Baade, Puyuan Peng, David Harwath

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.

Audio Classification

Contrastive Audio-Visual Masked Autoencoder

1 code implementation2 Oct 2022 Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.

 Ranked #1 on Audio Tagging on AudioSet (using extra training data)

Audio Classification Audio Tagging +4

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

1 code implementation3 Oct 2022 Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.

Language Modelling Retrieval +1

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

1 code implementation7 Oct 2022 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.

Knowledge Distillation Retrieval +2

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

1 code implementation1 Nov 2022 Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning.

Data Augmentation Image Retrieval +2

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

no code implementations2 Nov 2022 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.

Image Retrieval Retrieval +1

Phoneme Segmentation Using Self-Supervised Speech Models

1 code implementation2 Nov 2022 Luke Strgar, David Harwath

We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task.

Segmentation Transfer Learning

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

no code implementations3 Dec 2022 Reem Gody, David Harwath

Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

1 code implementation18 May 2023 Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

Audio-Visual Speech Recognition Prompt Engineering +2

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

2 code implementations19 May 2023 Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.

Language Modelling Masked Language Modeling +3

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

no code implementations21 May 2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.

Textless Low-Resource Speech-to-Speech Translation With Unit Language Models

1 code implementation24 May 2023 Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data.

Automatic Speech Recognition Denoising +6

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

1 code implementation14 Jun 2023 Anuj Diwan, Eunsol Choi, David Harwath

We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

no code implementations27 Jun 2023 Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.

Multi-Task Learning Scene Understanding +3

BAT: Learning to Reason about Spatial Sounds with Large Language Models

no code implementations2 Feb 2024 Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment.

Event Detection Language Modelling +5

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

2 code implementations25 Mar 2024 Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.

Language Modelling

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

no code implementations8 Apr 2024 Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Speak: A Toolkit Using Amazon Mechanical Turk to Collect and Validate Speech Audio Recordings

no code implementations LREC 2022 Christopher Song, David Harwath, Tuka Alhanai, James Glass

We present Speak, a toolkit that allows researchers to crowdsource speech audio recordings using Amazon Mechanical Turk (MTurk).

Cannot find the paper you are looking for? You can Submit a new open access paper.