no code implementations • LREC 2022 • Christopher Song, David Harwath, Tuka Alhanai, James Glass
We present Speak, a toolkit that allows researchers to crowdsource speech audio recordings using Amazon Mechanical Turk (MTurk).
1 code implementation • 19 Sep 2023 • Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information.
no code implementations • 27 Jun 2023 • Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.
1 code implementation • 14 Jun 2023 • Anuj Diwan, Eunsol Choi, David Harwath
We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.
1 code implementation • 24 May 2023 • Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages.
no code implementations • 21 May 2023 • Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass
Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.
1 code implementation • 19 May 2023 • Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
1 code implementation • 18 May 2023 • Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
no code implementations • 3 Dec 2022 • Reem Gody, David Harwath
Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 2 Dec 2022 • Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed
Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 2 Nov 2022 • Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
1 code implementation • 2 Nov 2022 • Luke Strgar, David Harwath
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task.
1 code implementation • 1 Nov 2022 • Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald
Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning.
1 code implementation • 7 Oct 2022 • Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.
1 code implementation • 3 Oct 2022 • Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.
1 code implementation • 2 Oct 2022 • Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.
Ranked #1 on
Audio Tagging
on AudioSet
(using extra training data)
2 code implementations • 30 Mar 2022 • Alan Baade, Puyuan Peng, David Harwath
In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.
3 code implementations • 28 Mar 2022 • Puyuan Peng, David Harwath
We present a method for visually-grounded spoken term discovery.
1 code implementation • 7 Feb 2022 • Puyuan Peng, David Harwath
In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark.
1 code implementation • CVPR 2022 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne
In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.
1 code implementation • 8 Dec 2021 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.
no code implementations • 1 Dec 2021 • Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.
1 code implementation • 8 Nov 2021 • Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
In this paper, we explore self-supervised audio-visual models that learn from instructional videos.
1 code implementation • 16 Sep 2021 • Puyuan Peng, David Harwath
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.
1 code implementation • 14 Jun 2021 • Changan Chen, Wei Sun, David Harwath, Kristen Grauman
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • CVPR 2021 • Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva
With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video.
1 code implementation • ICCV 2021 • Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.
no code implementations • ACL 2021 • Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision.
1 code implementation • 16 Jun 2020 • Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • ICLR 2020 • David Harwath, Wei-Ning Hsu, James Glass
What differentiates this paper from prior work on speech unit learning is the choice of training objective.
no code implementations • 9 Jul 2019 • Wei-Ning Hsu, David Harwath, James Glass
Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks.
no code implementations • 21 Feb 2019 • David Harwath, James Glass
In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes.
no code implementations • 9 Apr 2018 • David Harwath, Galen Chuang, James Glass
In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images.
no code implementations • ECCV 2018 • David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to.
no code implementations • 11 Dec 2017 • Kenneth Leidal, David Harwath, James Glass
In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs.
no code implementations • ACL 2017 • David Harwath, James R. Glass
Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • NeurIPS 2016 • David Harwath, Antonio Torralba, James Glass
Humans learn to speak before they can read or write, so why can’t computers do the same?
no code implementations • 11 Nov 2015 • David Harwath, James Glass
In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities.